├── README.md ├── Task1-Text Classification (LR) ├── data │ ├── sampleSubmission.csv │ ├── test.tsv │ └── train.tsv ├── models.py ├── numpy_based.py └── sklearn_based.py ├── Task2-Text Classification (RNN&CNN) ├── data │ ├── sampleSubmission.csv │ ├── test.tsv │ └── train.tsv ├── models.py ├── rnn.py ├── run_raw.py └── run_torchtext.py ├── Task3-Natural Language Inference ├── models.py ├── run.py └── util.py ├── Task4-Named Entity Recognization ├── bio2bioes.py ├── conlleval.pl ├── crf.py ├── data │ ├── dev.txt │ ├── test.txt │ └── train.txt ├── models.py ├── run.py └── util.py ├── Task5-Language Model ├── data │ └── poetryFromTang.txt ├── models.py ├── run.py └── util.py └── pics └── ESIM.jpg /README.md: -------------------------------------------------------------------------------- 1 | 2 | 实现了[nlp-beginner](https://github.com/FudanNLP/nlp-beginner)的几个任务,一方面自己练练手,另一方面供刚入门的朋友参考。才学疏浅,难免有不少问题,有任何问题可以发issue或者邮箱联系,万分感谢~ 3 | 4 | # 任务一:基于机器学习的文本分类 5 | 数据集地址:[Classify the sentiment of sentences from the Rotten Tomatoes dataset](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) 6 | 7 | 1. 用`sklearn`实现,`n-gram`特征提取直接用`sklearn`内置的`CountVectorizer`。 8 | 2. 用`numpy`实现: 9 | - 自定义`n-gram`特征抽取类,用`scipy`的`csr_matrix`来保存`doc-ngram`稀疏矩阵; 10 | - `numpy`实现用于二分类的`Logistic Regression`及用于多分类的`Softmax Regression`; 11 | - 实现三种梯度更新方式:`BGD`、`SGD`以及`MBGD`; 12 | 13 | |方法|参数|准确度| 14 | |-----|-----|------| 15 | |LR (sklearn)|C=0.8; penalty='l1'|0.587| 16 | |SoftmaxRegression (numpy)|C=0.8; penalty='l1'|0.548| 17 | 18 | 19 | # 任务二:基于深度学习的文本分类 20 | 数据集地址:[Classify the sentiment of sentences from the Rotten Tomatoes dataset](https://www.kaggle.com/c/sentiment-analysis-on-movie-reviews) 21 | 22 | 1. 老老实实一步步写分词、构建词典、数据向量化、加载词向量等; 23 | 2. 用`pytorch`写`RNN`, `GRU`, `LSTM`以及`TextCNN`,其中自己编写的`RNN`, `GRU`, `LSTM`模型已经测试与`pytorch`内部`nn.xxx`一致。 24 | 3. 用`torchtext`简化数据的处理操作; 25 | 26 | |方法|参数|准确度| 27 | |-----|-----|------| 28 | |RNN|epoch=5; hidden_size = 256; num_layers = 1; bidirectional = True; random embedding|0.629| 29 | |RNN|epoch=5; hidden_size = 256; num_layers = 1; bidirectional = True; glove_200 embedding|0.633| 30 | |CNN|epoch=5; num_filters = 200; kernel_sizes = [2, 3, 4]; random embedding|0.654| 31 | |CNN|epoch=5; num_filters = 200; kernel_sizes = [2, 3, 4]; glove_200 embedding|0.660| 32 | 33 | 34 | 说明:该实验`glove`词向量对结果提升不大,第四个实验效果较为显著; 35 | 36 | # 任务三:基于注意力机制的文本匹配 37 | 数据集地址:[SNLI](https://nlp.stanford.edu/projects/snli/) 38 | 39 | ![esim](pics/ESIM.jpg) 40 | 41 | 1. 实现[`ESIM`](https://arxiv.org/pdf/1609.06038v3.pdf)模型,如上图左边所示,模型主要分三层,由下至上: 42 | - 第一层用`BiLSTM`来对句子的每个词向量进行重新编码,使其具备全局性; 43 | - 第二层先用`Attention`来提取前提与假设之间的关系,然后重构,以前提为例: 44 | ```python 45 | # x1为前提,x2为假设,new_embed1是假设x1经过BiLSTM后的值,weight2是x2的每个词对x1的归一化相关程度,即attention值。 46 | # 1. 对假设x2进行加权求和,该值提取出了x2中与x1相关的部分; 47 | x1_align = torch.matmul(weight2, x2) 48 | # 2. 将四部分连接起来,用相减以及相乘来实现前提与假设的“交互推断”,文中说可以使得局部信息(如矛盾关系)更加明显; 49 | x1_combined = torch.cat([new_embed1, x1_align, new_embed1 - x1_align, new_embed1 * x1_align],dim=-1) 50 | ``` 51 | - 第三层是将重构后的前提和假设再放到一个`BiLSTM`中,文中说为了控制模型复杂度,在传入前先经过一个单层`FFN`并且`ReLU`一下。 52 | - 第四层是输出层,先`pooling`一下,文中前提和假设都用了`max-pooling`以及`average-pooling`两种形式,再对两者连接后MLP输出。 53 | 54 | 2. 注意用`torchtext`读取`json`文件的方式,及依赖解析数据的读取方式。其实`torchtext`内部有`dataset.nli`模块实现了 55 | `nli`数据读取方式,但是在代码中我还是用原版的`FIELD`来实现,便于理解其内部的处理流程。 56 | 57 | |model test accuracy|paper test accuracy| 58 | |-----|-----| 59 | |0.86|0.88| 60 | 61 | 62 | # 任务四:基于LSTM+CRF的序列标注 63 | 数据集地址:[CONLL 2003](https://www.clips.uantwerpen.be/conll2003/ner/) 64 | 65 | 采用`CNN+BiLSTM+CRF`结构,复现论文[`End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF`](https://arxiv.org/pdf/1603.01354.pdf) 66 | 67 | - 数据采用`BIOES`结构,用`bio2bioes.py`进行转换。 68 | - 关于预处理,将数字都变为0; 69 | - 设置预训练`embedding`的时候需注意模糊匹配(大小写); 70 | - 设置`vocab`的时候使要用到`train`以及`dev`+`test`中出现在`embedding`中的词; 71 | - 关于初始化,论文里有详细的设置,注意的是`LSTM`里面的`forget gate`的 `bias`初始为1,因为开始的时候记忆准确度不高。 72 | - 关于`dropout`,`LSTM`的输入和输出的时候都加`dropout`;字符`embedding`进入`CNN`前也要`dropout`。 73 | - 关于优化器,论文是采用了`SGD`,每个epoch都对学习率进行调整,注意这里不能用`pytorch`的`SGD`中的`weight decay`,因为它是对权重的衰减(L2正则),而我们要的是对学习率进行衰减。 74 | 75 | |embedding|entity-level F1|paper result| 76 | |----|------|---| 77 | |random (uniform)|83.18|80.76| 78 | |glove 100|90.7|91.21| 79 | 80 | 81 | 注意点:使用`torchtext`进行文本处理,需注意要由于`torchtext`只能处理`tsv/csv/json`格式的文本,这里需要自己 82 | 从文本读取的句子,也要自定义`Dataset`,`make Example`时的两种表现形式: 83 | - `list`形式,此时构建`FIELD`的时候`FIELD`内部的`tokenizer`不会起作用; 84 | - `str`形式,`word`空格隔开,此时构建`FIELD`的时候必须传入`tokenizer=lambda x:x.split()`(针对该任务)来覆盖内部的`tokenizer`, 85 | 否则会用内部的来分词,得到的`word`可能不一样了,也就和`label`不对应了。 86 | 87 | # 任务五:基于神经网络的语言模型 88 | 89 | 用`CharRNN`来写诗,评价指标为困惑度,是交叉熵损失取`exp`。 90 | 1. 采用单向`LSTM`,输入的最后一个为`[EOS]`,对应的`target`为输入右移一位; 91 | 2. 控制诗的长度为`MAX_LEN`,以一行为单位进行切分,即将超长的诗切分为若干短诗; 92 | 3. 训练200个`epoch`后,困惑度为400左右。 93 | >Input the first word or press Ctrl-C to exit: 鸟 94 | > 95 | >鸟渠霭乌秋,游王居信知。鹏未弟休不,深沙由意。寥五将不,两迹悄臣。生微心日,水复师尘。来称簸更,影乏魍无。 -------------------------------------------------------------------------------- /Task1-Text Classification (LR)/models.py: -------------------------------------------------------------------------------- 1 | import numpy as np 2 | 3 | 4 | class LogisticRegression(): 5 | ''' Only for two classes classification.''' 6 | 7 | def __init__(self, num_features, learning_rate=0.01, regularization=None, C=1): 8 | self.w = np.random.uniform(size=num_features) 9 | self.learning_rate = learning_rate 10 | self.num_features = num_features 11 | self.regularization = regularization 12 | self.C = C 13 | 14 | def _exp_dot(self, x): 15 | return np.exp(x.dot(self.w)) 16 | 17 | def predict(self, x): 18 | ''' 19 | Return the predicted classes. 20 | :param x: (batch_size, num_features) 21 | :return: (batch_size) 22 | ''' 23 | probs = sigmoid(self._exp_dot(x)) 24 | return (probs > 0.5).astype(np.int) 25 | 26 | def gd(self, x, y): 27 | ''' 28 | Perform one gradient descent. 29 | :param x:(batch_size, num_features) 30 | :param y:(batch_size) 31 | :return: None 32 | ''' 33 | probs = sigmoid(self._exp_dot(x)) 34 | gradients = (x.multiply((y - probs).reshape(-1, 1))).sum(0) 35 | gradients = np.array(gradients.tolist()).reshape(self.num_features) 36 | if self.regularization == "l2": 37 | self.w += self.learning_rate * (gradients * self.C - self.w) 38 | elif self.regularization == "l1": 39 | self.w += self.learning_rate * (gradients * self.C - np.sign(self.w)) 40 | else: 41 | self.w += self.learning_rate * gradients 42 | 43 | def mle(self, x, y): 44 | ''' Return the MLE estimates, log[p(y|x)]''' 45 | return (y * x.dot(self.w) - np.log(1 + self._exp_dot(x))).sum() 46 | 47 | 48 | def sigmoid(x): 49 | return 1 / (1 + np.exp(-x)) 50 | 51 | 52 | class SoftmaxRegression(): 53 | ''' Multi-classes classification.''' 54 | 55 | def __init__(self, num_features, num_classes, learning_rate=0.01, regularization=None, C=1): 56 | self.w = np.random.uniform(size=(num_features, num_classes)) 57 | self.learning_rate = learning_rate 58 | self.num_features = num_features 59 | self.num_classes = num_classes 60 | self.regularization = regularization 61 | self.C = C 62 | 63 | def predict(self, x): 64 | ''' 65 | Return the predicted classes. 66 | :param x: (batch_size, num_features) 67 | :return: (batch_size) 68 | ''' 69 | probs = softmax(x.dot(self.w)) 70 | return probs.argmax(-1) 71 | 72 | def gd(self, x, y): 73 | ''' 74 | Perform one gradient descent. 75 | :param x:(batch_size, num_features) 76 | :param y:(batch_size) 77 | :return: None 78 | ''' 79 | probs = softmax(x.dot(self.w)) 80 | gradients = x.transpose().dot(to_onehot(y, self.num_classes) - probs) 81 | if self.regularization == "l2": 82 | self.w += self.learning_rate * (gradients * self.C - self.w) 83 | elif self.regularization == "l1": 84 | self.w += self.learning_rate * (gradients * self.C - np.sign(self.w)) 85 | else: 86 | self.w += self.learning_rate * gradients 87 | 88 | def mle(self, x, y): 89 | ''' 90 | Perform the MLE estimation. 91 | :param x: (batch_size, num_features) 92 | :param y: (batch_size) 93 | :return: scalar 94 | ''' 95 | probs = softmax(x.dot(self.w)) 96 | return (to_onehot(y, self.num_classes) * np.log(probs)).sum() 97 | 98 | 99 | def softmax(x): 100 | return np.exp(x) / np.exp(x).sum(-1, keepdims=True) 101 | 102 | 103 | def to_onehot(x, class_num): 104 | return np.eye(class_num)[x] 105 | -------------------------------------------------------------------------------- /Task1-Text Classification (LR)/numpy_based.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | import os 3 | from collections import Counter 4 | from scipy.sparse import csr_matrix 5 | import numpy as np 6 | from models import LogisticRegression, SoftmaxRegression 7 | 8 | train_epochs = 10000 9 | learning_rate = 0.00005 10 | batch_size = 1024 11 | class_num = 5 12 | data_path = "data" 13 | regularization = "l1" 14 | C = 0.8 15 | 16 | 17 | class Ngram(): 18 | def __init__(self, n_grams, max_tf=0.8): 19 | ''' n_grams: tuple, n_gram range''' 20 | self.n_grams = n_grams 21 | self.tok2id = {} 22 | self.tok2tf = Counter() 23 | self.max_tf = max_tf 24 | 25 | @staticmethod 26 | def tokenize(text): 27 | ''' In this task, we simply the following tokenizer.''' 28 | return text.lower().split(" ") 29 | 30 | def get_n_grams(self, toks): 31 | ngrams_toks = [] 32 | for ngrams in range(self.n_grams[0], self.n_grams[1] + 1): 33 | for i in range(0, len(toks) - ngrams + 1): 34 | ngrams_toks.append(' '.join(toks[i:i + ngrams])) 35 | return ngrams_toks 36 | 37 | def fit(self, datas, fix_vocab=False): 38 | ''' Transform the data into n-gram vectors. Using csr_matrix to store this sparse matrix.''' 39 | if not fix_vocab: 40 | for data in datas: 41 | toks = self.tokenize(data) 42 | ngrams_toks = self.get_n_grams(toks) 43 | self.tok2tf.update(Counter(ngrams_toks)) 44 | self.tok2tf = dict(filter(lambda x: x[1] < self.max_tf * len(datas), self.tok2tf.items())) 45 | self.tok2id = dict([(k, i) for i, k in enumerate(self.tok2tf.keys())]) 46 | # the column indices for row i are stored in indices[indptr[i]:indptr[i+1]] 47 | # and their corresponding values are stored in nums[indptr[i]:indptr[i+1]] 48 | indices = [] 49 | indptr = [0] 50 | nums = [] 51 | for data in datas: 52 | toks = self.tokenize(data) 53 | ngrams_counter = Counter(self.get_n_grams(toks)) 54 | for k, v in ngrams_counter.items(): 55 | if k in self.tok2id: 56 | indices.append(self.tok2id[k]) 57 | nums.append(v) 58 | indptr.append(len(indices)) 59 | return csr_matrix((nums, indices, indptr), dtype=int, shape=(len(datas), len(self.tok2id))) 60 | 61 | 62 | def train_test_split(X, Y, shuffle=True): 63 | ''' 64 | Split data into train set, dev set and test set. 65 | ''' 66 | assert X.shape[0] == Y.shape[0], "The length of X and Y must be equal." 67 | len_ = X.shape[0] 68 | index = np.arange(0, len_) 69 | if shuffle: 70 | np.random.shuffle(index) 71 | train_num = int(0.8 * len_) 72 | dev_num = int(0.1 * len_) 73 | test_num = len_ - train_num - dev_num 74 | return X[index[:train_num]], X[index[train_num:train_num + dev_num]], X[index[-test_num:]], \ 75 | Y[index[:train_num]], Y[index[train_num:train_num + dev_num]], Y[index[-test_num:]] 76 | 77 | 78 | def minibatch(data, minibatch_idx): 79 | return data[minibatch_idx] if type(data) in [np.ndarray, csr_matrix] else [data[i] for i in minibatch_idx] 80 | 81 | 82 | def get_minibatches(data, minibatch_size, shuffle=True): 83 | """ 84 | Iterates through the provided data one minibatch at at time. You can use this function to 85 | iterate through data in minibatches as follows: 86 | 87 | for inputs_minibatch in get_minibatches(inputs, minibatch_size): 88 | ... 89 | 90 | Or with multiple data sources: 91 | 92 | for inputs_minibatch, labels_minibatch in get_minibatches([inputs, labels], minibatch_size): 93 | ... 94 | Args: 95 | data: there are two possible values: 96 | - a list or numpy array 97 | - a list where each element is either a list or numpy array 98 | minibatch_size: the maximum number of items in a minibatch 99 | shuffle: whether to randomize the order of returned data 100 | Returns: 101 | minibatches: the return value depends on data: 102 | - If data is a list/array it yields the next minibatch of data. 103 | - If data a list of lists/arrays it returns the next minibatch of each element in the 104 | list. This can be used to iterate through multiple data sources 105 | (e.g., features and labels) at the same time. 106 | """ 107 | list_data = type(data) is list and (type(data[0]) is list or type(data[0]) in [np.ndarray, csr_matrix]) 108 | if list_data: 109 | data_size = data[0].shape[0] if type(data[0]) in [np.ndarray, csr_matrix] else len(data[0]) 110 | else: 111 | data_size = data[0].shape[0] if type(data) in [np.ndarray, csr_matrix] else len(data) 112 | indices = np.arange(data_size) 113 | if shuffle: 114 | np.random.shuffle(indices) 115 | for minibatch_start in np.arange(0, data_size, minibatch_size): 116 | minibatch_indices = indices[minibatch_start:minibatch_start + minibatch_size] 117 | yield [minibatch(d, minibatch_indices) for d in data] if list_data \ 118 | else minibatch(data, minibatch_indices) 119 | 120 | 121 | if __name__ == "__main__": 122 | 123 | train = pd.read_csv(os.path.join(data_path, 'train.tsv'), sep='\t') 124 | test = pd.read_csv(os.path.join(data_path, 'test.tsv'), sep='\t') 125 | 126 | ngram = Ngram((1, 1)) 127 | X = ngram.fit(train['Phrase']) 128 | Y = train['Sentiment'].values 129 | 130 | # convert to 2 classes to test our LogisticRegression. 131 | # Y = train['Sentiment'].apply(lambda x:1 if x>2 else 0).values 132 | # lr = LogisticRegression(X.shape[1], learning_rate, "l2") 133 | 134 | lr = SoftmaxRegression(X.shape[1], class_num, learning_rate, regularization, C) 135 | 136 | train_X, dev_X, test_X, train_Y, dev_Y, test_Y = train_test_split(X, Y) 137 | 138 | # # Method1: (batch) gradient descent 139 | # for epoch in range(train_epochs): 140 | # train_mle = lr.mle(train_X, train_Y) 141 | # print("Epoch %s, Train MLE %.3f" % (epoch, train_mle)) 142 | # lr.gd(train_X, train_Y) 143 | # predict_dev_Y = lr.predict(dev_X) 144 | # print("Epoch %s, Dev Acc %.3f" % (epoch, (predict_dev_Y == dev_Y).sum() / len(dev_Y))) 145 | 146 | # # Method2: stochastic gradient descent 147 | # for epoch in range(train_epochs): 148 | # for batch_X, batch_Y in get_minibatches([train_X, train_Y], 1, True): 149 | # lr.gd(batch_X, batch_Y) 150 | # predict_dev_Y = lr.predict(dev_X) 151 | # print("Epoch %s, Dev Acc %.3f" % (epoch, (predict_dev_Y == dev_Y).sum() / len(dev_Y))) 152 | 153 | # Method3: mini-batch gradient descent 154 | for epoch in range(train_epochs): 155 | for batch_X, batch_Y in get_minibatches([train_X, train_Y], batch_size, True): 156 | lr.gd(batch_X, batch_Y) 157 | predict_dev_Y = lr.predict(dev_X) 158 | print("Epoch %s, Dev Acc %.3f" % (epoch, (predict_dev_Y == dev_Y).sum() / len(dev_Y))) 159 | 160 | # testing 161 | predict_test_Y = lr.predict(test_X) 162 | print("Test Acc %.3f" % ((predict_test_Y == test_Y).sum() / len(test_Y))) 163 | 164 | # predicting 165 | to_predict_X = ngram.fit(test['Phrase'], fix_vocab=True) 166 | test['Sentiment'] = lr.predict(to_predict_X) 167 | test[['Sentiment', 'PhraseId']].set_index('PhraseId').to_csv('numpy_based_lr.csv') 168 | -------------------------------------------------------------------------------- /Task1-Text Classification (LR)/sklearn_based.py: -------------------------------------------------------------------------------- 1 | import pandas as pd 2 | from sklearn.feature_extraction.text import CountVectorizer 3 | from sklearn.linear_model import LogisticRegression 4 | from sklearn.pipeline import Pipeline 5 | from sklearn.model_selection import StratifiedKFold, GridSearchCV 6 | import os 7 | 8 | data_path = "data" 9 | train = pd.read_csv(os.path.join(data_path, 'train.tsv'), sep='\t') 10 | test = pd.read_csv(os.path.join(data_path, 'test.tsv'), sep='\t') 11 | 12 | ngram_vectorizer = CountVectorizer(ngram_range=(1, 1)) 13 | lr = LogisticRegression() 14 | params = {'C': [0.5, 0.8, 1.0], 'penalty': ['l1', 'l2']} 15 | skf = StratifiedKFold(n_splits=3) 16 | gsv = GridSearchCV(lr, params, cv=skf) 17 | pipeline = Pipeline([("ngram", ngram_vectorizer), 18 | ("lr", lr) 19 | ]) 20 | X = train['Phrase'] 21 | y = train['Sentiment'] 22 | pipeline.fit(X, y) 23 | print(gsv.best_score_) 24 | print(gsv.best_params_) 25 | test['Sentiment'] = pipeline.predict(test['Phrase']) 26 | test[['Sentiment', 'PhraseId']].set_index('PhraseId').to_csv('sklearn_based_lr.csv') 27 | 28 | # 0.587 29 | -------------------------------------------------------------------------------- /Task2-Text Classification (RNN&CNN)/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | import math 5 | from rnn import RNNModel 6 | 7 | class RNN(nn.Module): 8 | def __init__(self, vocab_size, embed_size, hidden_size, num_layers, num_classes, bidirectional=True, 9 | dropout_rate=0.3): 10 | super(RNN, self).__init__() 11 | self.hidden_size = hidden_size 12 | self.num_layers = num_layers 13 | 14 | self.embed = nn.Embedding(vocab_size, embed_size) 15 | # self.rnn = nn.RNN(embed_size, hidden_size, num_layers, batch_first=True, bidirectional=bidirectional) 16 | self.rnn = RNNModel(embed_size, hidden_size, num_layers, batch_first=True, bidirectional=bidirectional) 17 | self.bidirectional = bidirectional 18 | if not bidirectional: 19 | self.fc = nn.Linear(hidden_size, num_classes) 20 | else: 21 | self.fc = nn.Linear(hidden_size * 2, num_classes) 22 | self.dropout = nn.Dropout(dropout_rate) 23 | 24 | self.init_weights() 25 | 26 | def init_weights(self): 27 | std = 1.0 / math.sqrt(self.hidden_size) 28 | for w in self.parameters(): 29 | w.data.uniform_(-std, std) 30 | 31 | def forward(self, x, lens): 32 | embeddings = self.embed(x) 33 | output, _ = self.rnn(embeddings) 34 | # get the output specified by length 35 | real_output = output[range(len(lens)), lens - 1] # (batch_size, seq_length, hidden_size*num_directions) 36 | out = self.fc(self.dropout(real_output)) 37 | return out 38 | 39 | 40 | class CNN(nn.Module): 41 | def __init__(self, vocab_size, embed_size, num_classes, num_filters=100, kernel_sizes=[3, 4, 5], dropout_rate=0.3): 42 | super(CNN, self).__init__() 43 | self.embed = nn.Embedding(vocab_size, embed_size) 44 | self.convs = nn.ModuleList([ 45 | nn.Conv2d(1, num_filters, (k, embed_size), padding=(k - 1, 0)) 46 | for k in kernel_sizes]) 47 | self.fc = nn.Linear(len(kernel_sizes) * num_filters, num_classes) 48 | self.dropout = nn.Dropout(dropout_rate) 49 | 50 | def conv_and_pool(self, x, conv): 51 | x = F.relu(conv(x).squeeze(3)) # (batch_size, num_filter, conv_seq_length) 52 | x_max = F.max_pool1d(x, x.size(2)).squeeze(2) # (batch_size, num_filter) 53 | return x_max 54 | 55 | def forward(self, x, lens): 56 | embed = self.embed(x).unsqueeze(1) # (batch_size, 1, seq_length, embedding_dim) 57 | 58 | conv_results = [self.conv_and_pool(embed, conv) for conv in self.convs] 59 | 60 | out = torch.cat(conv_results, 1) # (batch_size, num_filter * len(kernel_sizes)) 61 | return self.fc(self.dropout(out)) 62 | -------------------------------------------------------------------------------- /Task2-Text Classification (RNN&CNN)/rnn.py: -------------------------------------------------------------------------------- 1 | #!/usr/bin/python3 2 | # -*- coding: utf-8 -*- 3 | # @Time : 2019/7/31 10:11 4 | # @Author : Jiacheng Ye 5 | 6 | import torch 7 | import torch.nn as nn 8 | import math 9 | 10 | 11 | class RNNCell(nn.Module): 12 | def __init__(self, input_size, hidden_size): 13 | super(RNNCell, self).__init__() 14 | self.input_size = input_size 15 | self.hidden_size = hidden_size 16 | self.x2h = nn.Linear(input_size, hidden_size) 17 | self.h2h = nn.Linear(hidden_size, hidden_size) 18 | self.init_weights() 19 | 20 | def init_weights(self): 21 | std = 1.0 / math.sqrt(self.hidden_size) 22 | for w in self.parameters(): 23 | w.data.uniform_(-std, std) 24 | 25 | def forward(self, x, hidden): 26 | ''' 27 | 28 | :param x: [batch_size, input_size] 29 | :param hidden: [batch_size, hidden_size] 30 | :return: h_n: [batch_size, hidden_size] 31 | ''' 32 | return torch.tanh(self.x2h(x) + self.h2h(hidden)) 33 | 34 | 35 | class GRUCell(nn.Module): 36 | def __init__(self, input_size, hidden_size): 37 | super(GRUCell, self).__init__() 38 | self.input_size = input_size 39 | self.hidden_size = hidden_size 40 | self.x2h = nn.Linear(input_size, 3 * hidden_size) 41 | self.h2h = nn.Linear(hidden_size, 3 * hidden_size) 42 | self.init_weights() 43 | 44 | def init_weights(self): 45 | std = 1.0 / math.sqrt(self.hidden_size) 46 | for w in self.parameters(): 47 | w.data.uniform_(-std, std) 48 | 49 | def forward(self, x, hidden): 50 | ''' 51 | 52 | :param x: [batch_size, input_size] 53 | :param hidden: [batch_size, hidden_size] 54 | :return: h_n: [batch_size, hidden_size] 55 | ''' 56 | gate_x = self.x2h(x) 57 | gate_h = self.h2h(hidden) 58 | 59 | i_r, i_i, i_n = gate_x.chunk(3, 1) 60 | h_r, h_i, h_n = gate_h.chunk(3, 1) 61 | 62 | resetgate = torch.sigmoid(i_r + h_r) 63 | inputgate = torch.sigmoid(i_i + h_i) 64 | newgate = torch.tanh(i_n + (resetgate * h_n)) 65 | 66 | h_n = newgate + inputgate * (hidden - newgate) 67 | return h_n 68 | 69 | 70 | class LSTMCell(nn.Module): 71 | def __init__(self, input_size, hidden_size): 72 | super(LSTMCell, self).__init__() 73 | self.input_size = input_size 74 | self.hidden_size = hidden_size 75 | self.x2h = nn.Linear(input_size, 4 * hidden_size) 76 | self.h2h = nn.Linear(hidden_size, 4 * hidden_size) 77 | self.init_weights() 78 | 79 | def init_weights(self): 80 | std = 1.0 / math.sqrt(self.hidden_size) 81 | for w in self.parameters(): 82 | w.data.uniform_(-std, std) 83 | 84 | def forward(self, x, hidden): 85 | ''' 86 | 87 | :param x: [batch_size, input_size] 88 | :param hidden: tuple of [batch_size, hidden_size] 89 | :return: (h_n, c_n), each size is [batch_size, hidden_size] 90 | ''' 91 | hx, cx = hidden 92 | gates = self.x2h(x) + self.h2h(hx) 93 | ingate, forgetgate, cellgate, outgate = gates.chunk(4, 1) 94 | 95 | ingate = torch.sigmoid(ingate) 96 | forgetgate = torch.sigmoid(forgetgate) 97 | cellgate = torch.tanh(cellgate) 98 | outgate = torch.sigmoid(outgate) 99 | 100 | c_n = torch.mul(cx, forgetgate) + torch.mul(ingate, cellgate) 101 | h_n = torch.mul(outgate, torch.tanh(c_n)) 102 | return (h_n, c_n) 103 | 104 | 105 | class RNNModel(nn.Module): 106 | def __init__(self, input_size, hidden_size, num_layers, bidirectional=False, batch_first=True, dropout=0.5, 107 | mode="RNN"): 108 | super(RNNModel, self).__init__() 109 | 110 | self.bidirectional = bidirectional 111 | self.hidden_size = hidden_size 112 | self.num_layers = num_layers 113 | self.num_directions = num_directions = 2 if bidirectional else 1 114 | self.mode = mode 115 | self.batch_first = batch_first 116 | self.dropout = nn.Dropout(dropout) 117 | self.cells = cells = nn.ModuleList() 118 | if mode == "RNN": 119 | cell_cls = RNNCell 120 | elif mode == "GRU": 121 | cell_cls = GRUCell 122 | elif mode == "LSTM": 123 | cell_cls = LSTMCell 124 | else: 125 | raise NotImplementedError(mode + " mode not supported, choose 'RNN', 'GRU' or 'LSTM'.") 126 | for layer in range(num_layers): 127 | for direction in range(num_directions): 128 | rnn_cell = cell_cls(input_size, hidden_size) if layer == 0 else cell_cls(hidden_size * num_directions, 129 | hidden_size) 130 | cells.append(rnn_cell) 131 | 132 | def forward(self, x): 133 | ''' 134 | 135 | :param x: [batch_size, max_seq_length, input_size] if batch_first is True 136 | :return: output: [batch, seq_len, num_directions * hidden_size] if batch_first is True 137 | hidden: [num_layers * num_directions, batch, hidden_size] if mode is "RNN" or "GRU", if mode is "LSTM", 138 | hidden will be (h_n, c_n), each size is [num_layers * num_directions, batch, hidden_size]. 139 | 140 | ''' 141 | if self.batch_first: 142 | batch_size = x.size(0) 143 | inputs = x.transpose(0, 1) 144 | else: 145 | batch_size = x.size(1) 146 | inputs = x 147 | h0 = torch.zeros(self.num_layers * self.num_directions, batch_size, self.hidden_size).to(x.device) 148 | if self.mode == 'LSTM': 149 | h0 = (h0, h0) 150 | outs = [] 151 | hiddens = [] 152 | for layer in range(self.num_layers): 153 | inputs = inputs if layer == 0 else self.dropout(outs) # [max_seq_length, batch_size, layer_input_size] 154 | layer_outs_with_directions = [] 155 | for direction in range(self.num_directions): 156 | idx = layer * self.num_directions + direction 157 | inputs = inputs if direction == 0 else inputs.flip(0) 158 | rnn_cell = self.cells[idx] 159 | if self.mode == 'LSTM': 160 | layer_hn = (h0[0][idx], h0[1][idx]) # tuple of [batch_size, hidden_size], (h0, c0) 161 | else: 162 | layer_hn = h0[idx] 163 | layer_outs = [] 164 | for time_step in range(inputs.size(0)): 165 | layer_hn = rnn_cell(inputs[time_step], layer_hn) 166 | layer_outs.append(layer_hn) 167 | if self.mode == 'LSTM': 168 | layer_outs = torch.stack([out[0] for out in layer_outs]) # [max_seq_len, batch_size, hidden_size] 169 | else: 170 | layer_outs = torch.stack(layer_outs) # [max_seq_len, batch_size, hidden_size] 171 | layer_outs_with_directions.append(layer_outs if direction == 0 else layer_outs.flip(0)) 172 | hiddens.append(layer_hn) 173 | outs = torch.cat(layer_outs_with_directions, -1) # [max_seq_len, batch_size, 2*hidden_size] 174 | 175 | if self.batch_first: 176 | output = outs.transpose(0, 1) 177 | else: 178 | output = outs 179 | if self.mode == 'LSTM': 180 | hidden = (torch.stack([h[0] for h in hiddens]), torch.stack([h[1] for h in hiddens])) 181 | else: 182 | hidden = torch.stack(hiddens) 183 | 184 | return output, hidden 185 | 186 | 187 | def test_RNN_Model(): 188 | input_size, hidden_size, num_layers, bidirectional, batch_first = 50, 100, 2, True, False 189 | dropout = 0.1 190 | 191 | x = torch.randn([20, 15, 50]) 192 | mymodel = RNNModel(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=batch_first, 193 | dropout=dropout, 194 | mode="RNN") 195 | for cell in mymodel.cells: 196 | for w in cell.parameters(): 197 | nn.init.constant_(w.data, 0.01) 198 | 199 | model = nn.RNN(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=batch_first, 200 | dropout=dropout) 201 | for w in model.parameters(): 202 | nn.init.constant_(w.data, 0.01) 203 | 204 | torch.manual_seed(1) 205 | outs, hidden = model(x) 206 | torch.manual_seed(1) 207 | myouts, myhidden = mymodel(x) 208 | 209 | assert (hidden != myhidden).sum().item() == 0, "hidden don't match, RNNcell maybe wrong!" 210 | assert (outs != myouts).sum().item() == 0, "outs don't match, RNNcell maybe wrong!" 211 | 212 | 213 | def test_GRU_Model(): 214 | input_size, hidden_size, num_layers, bidirectional, batch_first = 50, 100, 2, True, False 215 | dropout = 0.1 216 | torch.manual_seed(1) 217 | x = torch.randn([20, 15, 50]) 218 | 219 | mymodel = RNNModel(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=True, 220 | dropout=dropout, 221 | mode="GRU") 222 | for cell in mymodel.cells: 223 | for w in cell.parameters(): 224 | nn.init.constant_(w.data, 0.01) 225 | 226 | model = nn.GRU(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=True, dropout=dropout) 227 | for w in model.parameters(): 228 | nn.init.constant_(w.data, 0.01) 229 | 230 | torch.manual_seed(1) 231 | outs, hidden = model(x) 232 | torch.manual_seed(1) 233 | myouts, myhidden = mymodel(x) 234 | 235 | assert (hidden != myhidden).sum().item() == 0, "hidden don't match, GRUcell maybe wrong!" 236 | assert (outs != myouts).sum().item() == 0, "outs don't match, GRUcell maybe wrong!" 237 | 238 | 239 | def test_LSTM_Model(): 240 | input_size, hidden_size, num_layers, bidirectional, batch_first = 50, 100, 2, True, False 241 | dropout = 0.1 242 | 243 | x = torch.randn([20, 15, 50]) 244 | 245 | mymodel = RNNModel(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=batch_first, 246 | dropout=dropout, 247 | mode="LSTM") 248 | for cell in mymodel.cells: 249 | for w in cell.parameters(): 250 | nn.init.constant_(w.data, 0.01) 251 | 252 | model = nn.LSTM(input_size, hidden_size, num_layers, bidirectional=bidirectional, batch_first=batch_first, 253 | dropout=dropout) 254 | for w in model.parameters(): 255 | nn.init.constant_(w.data, 0.01) 256 | 257 | torch.manual_seed(1) 258 | outs, (h_n, c_n) = model(x) 259 | torch.manual_seed(1) 260 | myouts, (myh_n, myc_n) = mymodel(x) 261 | 262 | assert (h_n != myh_n).sum().item() == 0, "h_n don't match, LSTMcell maybe wrong!" 263 | assert (c_n != myc_n).sum().item() == 0, "c_n don't match, LSTMcell maybe wrong!" 264 | assert (outs != myouts).sum().item() == 0, "outs don't match, LSTMcell maybe wrong!" 265 | 266 | 267 | def test(): 268 | test_RNN_Model() 269 | test_GRU_Model() 270 | test_LSTM_Model() 271 | -------------------------------------------------------------------------------- /Task2-Text Classification (RNN&CNN)/run_raw.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from torch.utils.data import DataLoader 4 | from torch.utils.data import Dataset 5 | from tqdm import tqdm, trange 6 | from torch.optim import Adam 7 | from tensorboardX import SummaryWriter 8 | import pandas as pd 9 | from collections import OrderedDict, Counter 10 | import os, re 11 | 12 | from models import RNN, CNN 13 | 14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 15 | 16 | train_epochs = 10 17 | batch_size = 512 18 | learning_rate = 0.001 19 | max_seq_length = 48 20 | num_classes = 5 21 | dropout_rate = 0.1 22 | data_path = "data" 23 | clip = 5 24 | embed_size = 200 25 | use_pretrained_embedding = True 26 | embed_path = '/home/yjc/embeddings/glove.6B.200d.txt' 27 | freeze = False 28 | use_rnn = True # set True to use RNN, otherwise CNN. 29 | 30 | # parameters for RNN 31 | hidden_size = 256 32 | num_layers = 1 33 | bidirectional = True 34 | 35 | # parameters for CNN 36 | num_filters = 100 37 | kernel_sizes = [2, 3, 4] # n-gram 38 | 39 | 40 | class Tokenizer(): 41 | def __init__(self, datas, vocabulary=None): 42 | self.data_len = len(datas) 43 | if vocabulary: 44 | self.tok2id = vocabulary 45 | else: 46 | self.tok2id = self.build_dict(self.tokenize(' '.join(datas)), offset=4) 47 | 48 | self.tok2id['[PAD]'] = 0 49 | self.tok2id['[UNK]'] = 1 50 | 51 | self.id2tok = OrderedDict([(id, tok) for tok, id in self.tok2id.items()]) 52 | 53 | def build_dict(self, words, offset=0, max_words=None, max_df=None): 54 | cnt = Counter(words) 55 | if max_df: 56 | words = dict(filter(lambda x: x[1] < max_df * self.data_len, cnt.items())) 57 | words = sorted(words.items(), key=lambda x: x[1], reverse=True) 58 | if max_words: 59 | words = words[:max_words] # [(word, count)] 60 | return {word: offset + i for i, (word, _) in enumerate(words)} 61 | 62 | @staticmethod 63 | def tokenize(text): 64 | # return re.compile(r'\b\w\w+\b').findall(text) 65 | return text.split(" ") 66 | 67 | def convert_ids_to_tokens(self, ids): 68 | return [self.id2tok[i] for i in ids] 69 | 70 | def convert_tokens_to_ids(self, tokens): 71 | ids = [] 72 | for token in tokens: 73 | if not self.tok2id.get(token): 74 | ids.append(self.tok2id["[UNK]"]) 75 | else: 76 | ids.append(self.tok2id[token]) 77 | return ids 78 | 79 | 80 | class MyDataset(Dataset): 81 | def __init__(self, datas, max_seq_length, tokenizer): 82 | self.datas = datas 83 | self.tokenizer = tokenizer 84 | self.max_seq_length = max_seq_length 85 | 86 | def __len__(self): 87 | return len(self.datas) 88 | 89 | def __getitem__(self, item): 90 | toks = self.tokenizer.tokenize(self.datas['Phrase'][item].lower()) 91 | cur_example = InputExample(uid=item, toks=toks, labels=self.datas['Sentiment'][item]) 92 | cur_features = convert_example_to_features(cur_example, self.max_seq_length, self.tokenizer) 93 | cur_tensors = ( 94 | torch.LongTensor(cur_features.input_ids), 95 | torch.tensor(cur_features.label_ids) 96 | ) 97 | return cur_tensors 98 | 99 | 100 | class InputExample(object): 101 | def __init__(self, uid, toks, labels=None): 102 | self.toks = toks 103 | self.labels = labels 104 | self.uid = uid 105 | 106 | 107 | class InputFeatures(object): 108 | def __init__(self, eid, input_ids, label_ids): 109 | self.input_ids = input_ids 110 | self.label_ids = label_ids 111 | self.eid = eid 112 | 113 | 114 | def convert_example_to_features(example, max_seq_length, tokenizer): 115 | """Convert a raw sample (pair of sentences as tokenized strings) into a proper training sample as ids.""" 116 | input_ids = tokenizer.convert_tokens_to_ids(example.toks)[:max_seq_length] 117 | # pad up to the sequence length. 118 | while len(input_ids) < max_seq_length: 119 | input_ids.append(0) 120 | 121 | if example.uid == 0: 122 | print("*** Example ***") 123 | print("uid: %s" % example.uid) 124 | print("tokens: %s" % " ".join([str(x) for x in example.toks])) 125 | print("input_ids: %s" % " ".join([str(x) for x in input_ids])) 126 | print("label: %s " % example.labels) 127 | 128 | features = InputFeatures(input_ids=input_ids, 129 | eid=example.uid, 130 | label_ids=example.labels, 131 | ) 132 | return features 133 | 134 | 135 | def load_word_vector_mapping(file): 136 | ret = OrderedDict() 137 | with open(file, encoding="utf8") as f: 138 | for line in f.readlines(): 139 | word, vec = line.split(" ", 1) 140 | ret[word] = list(map(float, vec.split())) 141 | return ret 142 | 143 | 144 | if __name__ == "__main__": 145 | # load data 146 | datas = pd.read_csv(os.path.join(data_path, 'train.tsv'), sep='\t') 147 | train_num = int(len(datas) * 0.8) 148 | train_data = datas[:train_num] 149 | dev_data = datas[train_num:] 150 | dev_data.index = range(len(dev_data)) 151 | test_data = pd.read_csv(os.path.join(data_path, 'test.tsv'), sep='\t') 152 | test_data["Sentiment"] = [0 for _ in range(len(test_data))] 153 | 154 | # build model 155 | if use_pretrained_embedding: 156 | word2vec = load_word_vector_mapping(embed_path) 157 | words = list(word2vec.keys()) 158 | tok2id = dict([(x, i) for i, x in enumerate(words, 2)]) 159 | tokenizer = Tokenizer(train_data['Phrase'], tok2id) 160 | vecs = list(word2vec.values()) 161 | assert embed_size == len( 162 | vecs[0]), "Parameter embed_size must be equal to the embed_size of the pretrained embeddings." 163 | vecs.insert(0, [.0 for _ in range(embed_size)]) # PAD 164 | vecs.insert(1, [.0 for _ in range(embed_size)]) # UNK 165 | 166 | vocab_size = len(tokenizer.tok2id) 167 | if use_rnn: 168 | model = RNN(vocab_size, embed_size, hidden_size, num_layers, num_classes, bidirectional, dropout_rate).to( 169 | device) 170 | else: 171 | model = CNN(vocab_size, embed_size, num_classes, num_filters, kernel_sizes, dropout_rate).to(device) 172 | weights = torch.tensor(vecs) 173 | model.embed.from_pretrained(weights, freeze=freeze) 174 | else: 175 | tokenizer = Tokenizer(train_data['Phrase']) 176 | vocab_size = len(tokenizer.tok2id) 177 | if use_rnn: 178 | model = RNN(vocab_size, embed_size, hidden_size, num_layers, num_classes, bidirectional, dropout_rate).to( 179 | device) 180 | else: 181 | model = CNN(vocab_size, embed_size, num_classes, num_filters, kernel_sizes, dropout_rate).to(device) 182 | 183 | # build datasets 184 | train_dataset = MyDataset(train_data, max_seq_length, tokenizer) 185 | dev_dataset = MyDataset(dev_data, max_seq_length, tokenizer) 186 | test_dataset = MyDataset(test_data, max_seq_length, tokenizer) 187 | 188 | train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size) 189 | dev_dataloader = DataLoader(dev_dataset, shuffle=True, batch_size=batch_size) 190 | test_dataloader = DataLoader(test_dataset, batch_size=batch_size) 191 | 192 | optimizer = Adam(model.parameters(), lr=learning_rate) 193 | loss_func = nn.CrossEntropyLoss() 194 | writer = SummaryWriter('logs', comment="rnn") 195 | for epoch in trange(train_epochs, desc="Epoch"): 196 | model.train() 197 | ep_loss = 0 198 | for step, batch in enumerate(tqdm(train_dataloader, desc="Iteration")): 199 | inputs, labels = tuple(t.to(device) for t in batch) 200 | lens = (inputs != 0).sum(-1) # 0 is the id of [PAD]. 201 | outputs = model(inputs, lens) 202 | loss = loss_func(outputs, labels) 203 | ep_loss += loss.item() 204 | 205 | model.zero_grad() 206 | loss.backward() 207 | nn.utils.clip_grad_norm_(model.parameters(), clip) 208 | optimizer.step() 209 | 210 | writer.add_scalar('Train_Loss', loss, epoch) 211 | if step % 100 == 0: 212 | tqdm.write("Epoch %d, Step %d, Loss %.2f" % (epoch, step, loss.item())) 213 | 214 | # evaluating 215 | model.eval() 216 | with torch.no_grad(): 217 | corr_num = 0 218 | err_num = 0 219 | for batch in dev_dataloader: 220 | inputs, labels = tuple(t.to(device) for t in batch) 221 | lens = (inputs != 0).sum(-1) # 0 is the id of [PAD]. 222 | outputs = model(inputs, lens) 223 | corr_num += (outputs.argmax(1) == labels).sum().item() 224 | err_num += (outputs.argmax(1) != labels).sum().item() 225 | tqdm.write("Epoch %d, Accuracy %.3f" % (epoch, corr_num / (corr_num + err_num))) 226 | 227 | # predicting 228 | model.eval() 229 | with torch.no_grad(): 230 | predicts = [] 231 | for batch in test_dataloader: 232 | inputs, labels = tuple(t.to(device) for t in batch) 233 | lens = (inputs != 0).sum(-1) # 0 is the id of [PAD]. 234 | outputs = model(inputs, lens) 235 | predicts.extend(outputs.argmax(1).cpu().numpy()) 236 | test_data["Sentiment"] = predicts 237 | test_data[['PhraseId', 'Sentiment']].set_index('PhraseId').to_csv('result.csv') 238 | 239 | # 0.597 240 | -------------------------------------------------------------------------------- /Task2-Text Classification (RNN&CNN)/run_torchtext.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from tqdm import tqdm, trange 4 | from torch.optim import Adam 5 | from tensorboardX import SummaryWriter 6 | import pandas as pd 7 | import os 8 | from torchtext import data 9 | from torchtext.data import Iterator, BucketIterator 10 | from torchtext.vocab import Vectors 11 | 12 | from models import RNN, CNN 13 | 14 | device = torch.device("cuda" if torch.cuda.is_available() else "cpu") 15 | 16 | train_epochs = 5 17 | batch_size = 512 18 | learning_rate = 0.001 19 | max_seq_length = 48 20 | num_classes = 5 21 | dropout_rate = 0.1 22 | data_path = "data" 23 | clip = 5 24 | 25 | # embedding 26 | embed_size = 200 27 | # vectors = None 28 | vectors = Vectors('glove.6B.200d.txt', '/home/yjc/embeddings') 29 | freeze = False 30 | 31 | use_rnn = True 32 | # parameters for RNN 33 | hidden_size = 256 34 | num_layers = 1 35 | bidirectional = True 36 | 37 | # parameters for CNN 38 | num_filters = 200 39 | kernel_sizes = [2, 3, 4] # n-gram 40 | 41 | 42 | def load_iters(batch_size=32, device="cpu", data_path='data', vectors=None): 43 | TEXT = data.Field(lower=True, batch_first=True, include_lengths=True) 44 | LABEL = data.LabelField(batch_first=True) 45 | train_fields = [(None, None), (None, None), ('text', TEXT), ('label', LABEL)] 46 | test_fields = [(None, None), (None, None), ('text', TEXT)] 47 | 48 | train_data = data.TabularDataset.splits( 49 | path=data_path, 50 | train='train.tsv', 51 | format='tsv', 52 | fields=train_fields, 53 | skip_header=True 54 | )[0] # return is a tuple. 55 | 56 | test_data = data.TabularDataset.splits( 57 | path='data', 58 | train='test.tsv', 59 | format='tsv', 60 | fields=test_fields, 61 | skip_header=True 62 | )[0] 63 | 64 | TEXT.build_vocab(train_data.text, vectors=vectors) 65 | LABEL.build_vocab(train_data.label) 66 | train_data, dev_data = train_data.split([0.8, 0.2]) 67 | 68 | train_iter, dev_iter = BucketIterator.splits( 69 | (train_data, dev_data), 70 | batch_sizes=(batch_size, batch_size), 71 | device=device, 72 | sort_key=lambda x: len(x.text), 73 | sort_within_batch=True, 74 | repeat=False, 75 | shuffle=True 76 | ) 77 | 78 | test_iter = Iterator(test_data, batch_size=batch_size, device=device, sort=False, sort_within_batch=False, 79 | repeat=False, shuffle=False) 80 | return train_iter, dev_iter, test_iter, TEXT, LABEL 81 | 82 | 83 | if __name__ == "__main__": 84 | train_iter, dev_iter, test_iter, TEXT, LABEL = load_iters(batch_size, device, data_path, vectors) 85 | vocab_size = len(TEXT.vocab.itos) 86 | # build model 87 | if use_rnn: 88 | model = RNN(vocab_size, embed_size, hidden_size, num_layers, num_classes, bidirectional, dropout_rate) 89 | else: 90 | model = CNN(vocab_size, embed_size, num_classes, num_filters, kernel_sizes, dropout_rate) 91 | if vectors is not None: 92 | model.embed.from_pretrained(TEXT.vocab.vectors, freeze=freeze) 93 | model.to(device) 94 | 95 | optimizer = Adam(model.parameters(), lr=learning_rate) 96 | loss_func = nn.CrossEntropyLoss() 97 | writer = SummaryWriter('logs', comment="rnn") 98 | for epoch in trange(train_epochs, desc="Epoch"): 99 | model.train() 100 | ep_loss = 0 101 | for step, batch in enumerate(tqdm(train_iter, desc="Iteration")): 102 | (inputs, lens), labels = batch.text, batch.label 103 | outputs = model(inputs, lens) 104 | loss = loss_func(outputs, labels) 105 | ep_loss += loss.item() 106 | 107 | model.zero_grad() 108 | loss.backward() 109 | nn.utils.clip_grad_norm_(model.parameters(), clip) 110 | optimizer.step() 111 | 112 | writer.add_scalar('Train_Loss', loss, epoch) 113 | if step % 100 == 0: 114 | tqdm.write("Epoch %d, Step %d, Loss %.2f" % (epoch, step, loss.item())) 115 | 116 | # evaluating 117 | model.eval() 118 | with torch.no_grad(): 119 | corr_num = 0 120 | err_num = 0 121 | for batch in dev_iter: 122 | (inputs, lens), labels = batch.text, batch.label 123 | outputs = model(inputs, lens) 124 | corr_num += (outputs.argmax(1) == labels).sum().item() 125 | err_num += (outputs.argmax(1) != labels).sum().item() 126 | tqdm.write("Epoch %d, Accuracy %.3f" % (epoch, corr_num / (corr_num + err_num))) 127 | 128 | # predicting 129 | model.eval() 130 | with torch.no_grad(): 131 | predicts = [] 132 | for batch in test_iter: 133 | inputs, lens = batch.text 134 | outputs = model(inputs, lens) 135 | predicts.extend(outputs.argmax(1).cpu().numpy()) 136 | test_data = pd.read_csv(os.path.join(data_path, 'test.tsv'), sep='\t') 137 | test_data["Sentiment"] = predicts 138 | test_data[['PhraseId', 'Sentiment']].set_index('PhraseId').to_csv('result.csv') 139 | -------------------------------------------------------------------------------- /Task3-Natural Language Inference/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | import torch.nn.functional as F 4 | 5 | 6 | class BiLSTM(nn.Module): 7 | def __init__(self, input_size, hidden_size=128, dropout_rate=0.1, layer_num=1): 8 | super(BiLSTM, self).__init__() 9 | self.hidden_size = hidden_size 10 | if layer_num == 1: 11 | self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, bidirectional=True) 12 | 13 | else: 14 | self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, dropout=dropout_rate, 15 | bidirectional=True) 16 | self.init_weights() 17 | 18 | def init_weights(self): 19 | for p in self.bilstm.parameters(): 20 | if p.dim() > 1: 21 | nn.init.normal_(p) 22 | p.data.mul_(0.01) 23 | else: 24 | p.data.zero_() 25 | # This is the range of indices for our forget gates for each LSTM cell 26 | p.data[self.hidden_size // 2: self.hidden_size] = 1 27 | 28 | def forward(self, x, lens): 29 | ''' 30 | :param x: (batch, seq_len, input_size) 31 | :param lens: (batch, ) 32 | :return: (batch, seq_len, hidden_size) 33 | ''' 34 | ordered_lens, index = lens.sort(descending=True) 35 | ordered_x = x[index] 36 | 37 | packed_x = nn.utils.rnn.pack_padded_sequence(ordered_x, ordered_lens, batch_first=True) 38 | packed_output, _ = self.bilstm(packed_x) 39 | output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True) 40 | 41 | recover_index = index.argsort() 42 | recover_output = output[recover_index] 43 | return recover_output 44 | 45 | 46 | class ESIM(nn.Module): 47 | def __init__(self, vocab_size, num_labels, embed_size, hidden_size, dropout_rate=0.1, layer_num=1, 48 | pretrained_embed=None, freeze=False): 49 | super(ESIM, self).__init__() 50 | self.pretrained_embed = pretrained_embed 51 | if pretrained_embed is not None: 52 | self.embed = nn.Embedding.from_pretrained(pretrained_embed, freeze) 53 | else: 54 | self.embed = nn.Embedding(vocab_size, embed_size) 55 | self.bilstm1 = BiLSTM(embed_size, hidden_size, dropout_rate, layer_num) 56 | self.bilstm2 = BiLSTM(hidden_size, hidden_size, dropout_rate, layer_num) 57 | self.fc1 = nn.Linear(4 * hidden_size, hidden_size) 58 | self.fc2 = nn.Linear(4 * hidden_size, hidden_size) 59 | self.fc3 = nn.Linear(hidden_size, num_labels) 60 | self.dropout = nn.Dropout(dropout_rate) 61 | 62 | self.init_weight() 63 | 64 | def init_weight(self): 65 | if self.pretrained_embed is None: 66 | nn.init.normal_(self.embed.weight) 67 | self.embed.weight.data.mul_(0.01) 68 | nn.init.normal_(self.fc1.weight) 69 | self.fc1.weight.data.mul_(0.01) 70 | nn.init.normal_(self.fc2.weight) 71 | self.fc2.weight.data.mul_(0.01) 72 | nn.init.normal_(self.fc3.weight) 73 | self.fc3.weight.data.mul_(0.01) 74 | 75 | 76 | def soft_align_attention(self, x1, x1_lens, x2, x2_lens): 77 | ''' 78 | local inference modeling 79 | :param x1: (batch, seq1_len, hidden_size) 80 | :param x1_lens: (batch, ) 81 | :param x2: (batch, seq2_len, hidden_size) 82 | :param x2_lens: (batch, ) 83 | :return: x1_align (batch, seq1_len, hidden_size) 84 | x2_align (batch, seq2_len, hidden_size) 85 | ''' 86 | seq1_len = x1.size(1) 87 | seq2_len = x2.size(1) 88 | batch_size = x1.size(0) 89 | 90 | attention = torch.matmul(x1, x2.transpose(1, 2)) # (batch, seq1_len, seq2_len) 91 | mask1 = torch.arange(seq1_len).expand(batch_size, seq1_len).to(x1.device) >= x1_lens.unsqueeze( 92 | 1) # (batch, seq1_len), 1 means 93 | mask2 = torch.arange(seq2_len).expand(batch_size, seq2_len).to(x1.device) >= x2_lens.unsqueeze( 94 | 1) # (batch, seq2_len) 95 | mask1 = mask1.float().masked_fill_(mask1, float('-inf')) 96 | mask2 = mask2.float().masked_fill_(mask2, float('-inf')) 97 | weight2 = F.softmax(attention + mask2.unsqueeze(1), dim=-1) # (batch, seq1_len, seq2_len) 98 | x1_align = torch.matmul(weight2, x2) # (batch, seq1_len, hidden_size) 99 | weight1 = F.softmax(attention.transpose(1, 2) + mask1.unsqueeze(1), dim=-1) # (batch, seq2_len, seq1_len) 100 | x2_align = torch.matmul(weight1, x1) # (batch, seq2_len, hidden_size) 101 | return x1_align, x2_align 102 | 103 | def composition(self, x, lens): 104 | x = F.relu(self.fc1(x)) 105 | x_compose = self.bilstm2(self.dropout(x), lens) # (batch, seq_len, hidden_size) 106 | p1 = F.avg_pool1d(x_compose.transpose(1, 2), x.size(1)).squeeze(-1) # (batch, hidden_size) 107 | p2 = F.max_pool1d(x_compose.transpose(1, 2), x.size(1)).squeeze(-1) # (batch, hidden_size) 108 | return torch.cat([p1, p2], 1) # (batch, hidden_size*2) 109 | 110 | def forward(self, x1, x1_lens, x2, x2_lens): 111 | ''' 112 | :param x1: (batch, seq1_len) 113 | :param x1_lens: (batch,) 114 | :param x2: (batch, seq2_len) 115 | :param x2_lens: (batch,) 116 | :return: (batch, num_class) 117 | ''' 118 | # Input encoding 119 | embed1 = self.embed(x1) # (batch, seq1_len, embed_size) 120 | embed2 = self.embed(x2) # (batch, seq2_len, embed_size) 121 | new_embed1 = self.bilstm1(self.dropout(embed1), x1_lens) # (batch, seq1_len, hidden_size) 122 | new_embed2 = self.bilstm1(self.dropout(embed2), x2_lens) # (batch, seq2_len, hidden_size) 123 | 124 | # Local inference collected over sequence 125 | x1_align, x2_align = self.soft_align_attention(new_embed1, x1_lens, new_embed2, x2_lens) 126 | 127 | # Enhancement of local inference information 128 | x1_combined = torch.cat([new_embed1, x1_align, new_embed1 - x1_align, new_embed1 * x1_align], 129 | dim=-1) # (batch, seq1_len, 4*hidden_size) 130 | x2_combined = torch.cat([new_embed2, x2_align, new_embed2 - x2_align, new_embed2 * x2_align], 131 | dim=-1) # (batch, seq2_len, 4*hidden_size) 132 | 133 | # Inference composition 134 | x1_composed = self.composition(x1_combined, x1_lens) # (batch, 2*hidden_size), v=[v_avg; v_max] 135 | x2_composed = self.composition(x2_combined, x2_lens) # (batch, 2*hidden_size) 136 | composed = torch.cat([x1_composed, x2_composed], -1) # (batch, 4*hidden_size) 137 | 138 | # MLP classifier 139 | out = self.fc3(self.dropout(torch.tanh(self.fc2(self.dropout(composed))))) 140 | return out 141 | -------------------------------------------------------------------------------- /Task3-Natural Language Inference/run.py: -------------------------------------------------------------------------------- 1 | from util import load_iters 2 | import torch 3 | import torch.nn as nn 4 | import torch.optim as optim 5 | from torchtext.vocab import Vectors 6 | from models import ESIM 7 | from tqdm import tqdm 8 | torch.manual_seed(1) 9 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 10 | 11 | BATCH_SIZE = 32 12 | HIDDEN_SIZE = 600 # every LSTM's(forward and backward) hidden size is half of HIDDEN_SIZE 13 | EPOCHS = 20 14 | DROPOUT_RATE = 0.5 15 | LAYER_NUM = 1 16 | LEARNING_RATE = 4e-4 17 | PATIENCE = 5 18 | CLIP = 10 19 | EMBEDDING_SIZE = 300 20 | # vectors = None 21 | vectors = Vectors('glove.840B.300d.txt', '/home/yjc/embeddings') 22 | freeze = False 23 | data_path = 'data' 24 | 25 | def show_example(premise, hypothesis, label, TEXT, LABEL): 26 | tqdm.write('Label: ' + LABEL.vocab.itos[label]) 27 | tqdm.write('premise: ' + ' '.join([TEXT.vocab.itos[i] for i in premise])) 28 | tqdm.write('hypothesis: ' + ' '.join([TEXT.vocab.itos[i] for i in hypothesis])) 29 | 30 | 31 | def count_parameters(model): 32 | return sum(p.numel() for p in model.parameters() if p.requires_grad) 33 | 34 | 35 | def eval(data_iter, name, epoch=None, use_cache=False): 36 | if use_cache: 37 | model.load_state_dict(torch.load('best_model.ckpt')) 38 | model.eval() 39 | correct_num = 0 40 | err_num = 0 41 | total_loss = 0 42 | with torch.no_grad(): 43 | for i, batch in enumerate(data_iter): 44 | premise, premise_lens = batch.premise 45 | hypothesis, hypothesis_lens = batch.hypothesis 46 | labels = batch.label 47 | 48 | output = model(premise, premise_lens, hypothesis, hypothesis_lens) 49 | predicts = output.argmax(-1).reshape(-1) 50 | loss = loss_func(output, labels) 51 | total_loss += loss.item() 52 | correct_num += (predicts == labels).sum().item() 53 | err_num += (predicts != batch.label).sum().item() 54 | 55 | acc = correct_num / (correct_num + err_num) 56 | if epoch is not None: 57 | tqdm.write( 58 | "Epoch: %d, %s Acc: %.3f, Loss %.3f" % (epoch + 1, name, acc, total_loss)) 59 | else: 60 | tqdm.write( 61 | "%s Acc: %.3f, Loss %.3f" % (name, acc, total_loss)) 62 | return acc 63 | 64 | def train(train_iter, dev_iter, loss_func, optimizer, epochs, patience=5, clip=5): 65 | best_acc = -1 66 | patience_counter = 0 67 | for epoch in range(epochs): 68 | model.train() 69 | total_loss = 0 70 | for batch in tqdm(train_iter): 71 | premise, premise_lens = batch.premise 72 | hypothesis, hypothesis_lens = batch.hypothesis 73 | labels = batch.label 74 | # show_example(premise[0],hypothesis[0], labels[0], TEXT, LABEL) 75 | 76 | model.zero_grad() 77 | output = model(premise, premise_lens, hypothesis, hypothesis_lens) 78 | loss = loss_func(output, labels) 79 | total_loss += loss.item() 80 | loss.backward() 81 | torch.nn.utils.clip_grad_norm_(model.parameters(), clip) 82 | optimizer.step() 83 | tqdm.write("Epoch: %d, Train Loss: %d" % (epoch + 1, total_loss)) 84 | 85 | acc = eval(dev_iter, "Dev", epoch) 86 | if acc= patience: 93 | tqdm.write("Early stopping: patience limit reached, stopping...") 94 | break 95 | 96 | if __name__ == "__main__": 97 | train_iter, dev_iter, test_iter, TEXT, LABEL, _ = load_iters(BATCH_SIZE, device, data_path, vectors) 98 | 99 | model = ESIM(len(TEXT.vocab), len(LABEL.vocab.stoi), 100 | EMBEDDING_SIZE, HIDDEN_SIZE, DROPOUT_RATE, LAYER_NUM, 101 | TEXT.vocab.vectors, freeze).to(device) 102 | print(f'The model has {count_parameters(model):,} trainable parameters') 103 | 104 | optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE) 105 | loss_func = nn.CrossEntropyLoss() 106 | 107 | train(train_iter, dev_iter, loss_func, optimizer, EPOCHS,PATIENCE, CLIP) 108 | eval(test_iter, "Test", use_cache=True) 109 | -------------------------------------------------------------------------------- /Task3-Natural Language Inference/util.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf8 -*- 2 | from torchtext.data import Iterator, BucketIterator 3 | from torchtext import data 4 | import torch 5 | 6 | def load_iters(batch_size=32, device="cpu", data_path='data', vectors=None, use_tree=False): 7 | if not use_tree: 8 | TEXT = data.Field(batch_first=True, include_lengths=True, lower=True) 9 | LABEL = data.LabelField(batch_first=True) 10 | TREE = None 11 | 12 | fields = {'sentence1': ('premise', TEXT), 13 | 'sentence2': ('hypothesis', TEXT), 14 | 'gold_label': ('label', LABEL)} 15 | else: 16 | TEXT = data.Field(batch_first=True, 17 | lower=True, 18 | preprocessing=lambda parse: [t for t in parse if t not in ('(', ')')], 19 | include_lengths=True) 20 | LABEL = data.LabelField(batch_first=True) 21 | TREE = data.Field(preprocessing=lambda parse: ['reduce' if t == ')' else 'shift' for t in parse if t != '('], 22 | batch_first=True) 23 | 24 | TREE.build_vocab([['reduce'], ['shift']]) 25 | 26 | fields = {'sentence1_binary_parse': [('premise', TEXT), 27 | ('premise_transitions', TREE)], 28 | 'sentence2_binary_parse': [('hypothesis', TEXT), 29 | ('hypothesis_transitions', TREE)], 30 | 'gold_label': ('label', LABEL)} 31 | 32 | train_data, dev_data, test_data = data.TabularDataset.splits( 33 | path=data_path, 34 | train='snli_1.0_train.jsonl', 35 | validation='snli_1.0_dev.jsonl', 36 | test='snli_1.0_test.jsonl', 37 | format='json', 38 | fields=fields, 39 | filter_pred=lambda ex: ex.label != '-' # filter the example which label is '-'(means unlabeled) 40 | ) 41 | if vectors is not None: 42 | TEXT.build_vocab(train_data, vectors=vectors, unk_init=torch.Tensor.normal_) 43 | else: 44 | TEXT.build_vocab(train_data) 45 | LABEL.build_vocab(dev_data) 46 | 47 | train_iter, dev_iter = BucketIterator.splits( 48 | (train_data, dev_data), 49 | batch_sizes=(batch_size, batch_size), 50 | device=device, 51 | sort_key=lambda x: len(x.premise) + len(x.hypothesis), 52 | sort_within_batch=True, 53 | repeat=False, 54 | shuffle=True 55 | ) 56 | 57 | test_iter = Iterator(test_data, 58 | batch_size=batch_size, 59 | device=device, 60 | sort=False, 61 | sort_within_batch=False, 62 | repeat=False, 63 | shuffle=False) 64 | 65 | return train_iter, dev_iter, test_iter, TEXT, LABEL, TREE 66 | 67 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/bio2bioes.py: -------------------------------------------------------------------------------- 1 | import argparse 2 | 3 | parser = argparse.ArgumentParser(description='Change encoding from BIO to BIOES') 4 | parser.add_argument('input', metavar='-i', type=str, help='The path to the original file with BIO encoding') 5 | parser.add_argument('output', metavar='-o', type=str, help='The name of your BIOES encoded file') 6 | args = parser.parse_args() 7 | 8 | input_file = args.input 9 | output_file = args.output 10 | 11 | 12 | def read_file(input_file): 13 | with open(input_file, 'r', encoding="utf8") as f: 14 | return f.read().split('\n')[:-1] 15 | 16 | 17 | def write_line(new_label, prev_label, line_content, output_file): 18 | new_iob = new_label + prev_label 19 | line_content[-1] = new_iob 20 | current_line = ' '.join(line_content) 21 | output_file.write(current_line + '\n') 22 | 23 | 24 | def not_same_tag(tag1, tag2): 25 | return tag1.split("-")[-1] != tag2.split("-")[-1] 26 | 27 | 28 | def convert(input_file, output_path): 29 | output_file = open(output_path, 'w', encoding="utf8") 30 | 31 | for i in range(len(input_file)): 32 | 33 | try: 34 | current_line = input_file[i] 35 | 36 | if '-DOCSTART-' in current_line: 37 | output_file.write(current_line + "\n") 38 | elif len(current_line.strip()) == 0: 39 | output_file.write(current_line.strip() + "\n") 40 | output_file.flush() 41 | else: 42 | output_file.flush() 43 | if i == 280: 44 | print(current_line) 45 | prev_iob = "" 46 | next_iob = "" 47 | prev_line = None 48 | next_line = None 49 | 50 | try: 51 | prev_line = input_file[i - 1] 52 | next_line = input_file[i + 1] 53 | 54 | if len(prev_line.strip()) > 0: 55 | prev_line_content = prev_line.split() 56 | prev_iob = prev_line_content[-1] 57 | 58 | if len(next_line.strip()) > 0: 59 | next_line_content = next_line.split() 60 | next_iob = next_line_content[-1] 61 | 62 | except IndexError: 63 | pass 64 | 65 | current_line_content = current_line.strip().split() 66 | current_iob = current_line_content[-1] 67 | 68 | # Outside entities 69 | if current_iob == 'O': 70 | output_file.write(current_line + "\n") 71 | 72 | # Unit length entities 73 | elif current_iob.startswith("B-") and \ 74 | (next_iob == 'O' or len(next_line.strip()) == 0 or next_iob.startswith("B-")): 75 | write_line('S-', current_iob[2:], current_line_content, output_file) 76 | 77 | # First element of chunk 78 | elif current_iob.startswith("B-") and \ 79 | (not not_same_tag(current_iob, next_iob) and next_iob.startswith("I-")): 80 | write_line('B-', current_iob[2:], current_line_content, output_file) 81 | 82 | # Last element of chunk 83 | elif current_iob.startswith("I-") and \ 84 | (next_iob == 'O' or len(next_line.strip()) == 0 or next_iob.startswith("B-")): 85 | write_line('E-', current_iob[2:], current_line_content, output_file) 86 | 87 | # Inside a chunk 88 | elif current_iob.startswith("I-") and \ 89 | next_iob.startswith("I-"): 90 | write_line('I-', current_iob[2:], current_line_content, output_file) 91 | 92 | except IndexError: 93 | pass 94 | 95 | 96 | bio = read_file(input_file) 97 | convert(bio, output_file) 98 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/conlleval.pl: -------------------------------------------------------------------------------- 1 | #!/usr/bin/perl -w 2 | # conlleval: evaluate result of processing CoNLL-2000 shared task 3 | # usage: conlleval [-l] [-r] [-d delimiterTag] [-o oTag] < file 4 | # README: http://cnts.uia.ac.be/conll2000/chunking/output.html 5 | # options: l: generate LaTeX output for tables like in 6 | # http://cnts.uia.ac.be/conll2003/ner/example.tex 7 | # r: accept raw result tags (without B- and I- prefix; 8 | # assumes one word per chunk) 9 | # d: alternative delimiter tag (default is single space) 10 | # o: alternative outside tag (default is O) 11 | # note: the file should contain lines with items separated 12 | # by $delimiter characters (default space). The final 13 | # two items should contain the correct tag and the 14 | # guessed tag in that order. Sentences should be 15 | # separated from each other by empty lines or lines 16 | # with $boundary fields (default -X-). 17 | # url: http://lcg-www.uia.ac.be/conll2000/chunking/ 18 | # started: 1998-09-25 19 | # version: 2004-01-26 20 | # author: Erik Tjong Kim Sang 21 | 22 | use strict; 23 | 24 | my $false = 0; 25 | my $true = 42; 26 | 27 | my $boundary = "-X-"; # sentence boundary 28 | my $correct; # current corpus chunk tag (I,O,B) 29 | my $correctChunk = 0; # number of correctly identified chunks 30 | my $correctTags = 0; # number of correct chunk tags 31 | my $correctType; # type of current corpus chunk tag (NP,VP,etc.) 32 | my $delimiter = " "; # field delimiter 33 | my $FB1 = 0.0; # FB1 score (Van Rijsbergen 1979) 34 | my $firstItem; # first feature (for sentence boundary checks) 35 | my $foundCorrect = 0; # number of chunks in corpus 36 | my $foundGuessed = 0; # number of identified chunks 37 | my $guessed; # current guessed chunk tag 38 | my $guessedType; # type of current guessed chunk tag 39 | my $i; # miscellaneous counter 40 | my $inCorrect = $false; # currently processed chunk is correct until now 41 | my $lastCorrect = "O"; # previous chunk tag in corpus 42 | my $latex = 0; # generate LaTeX formatted output 43 | my $lastCorrectType = ""; # type of previously identified chunk tag 44 | my $lastGuessed = "O"; # previously identified chunk tag 45 | my $lastGuessedType = ""; # type of previous chunk tag in corpus 46 | my $lastType; # temporary storage for detecting duplicates 47 | my $line; # line 48 | my $nbrOfFeatures = -1; # number of features per line 49 | my $precision = 0.0; # precision score 50 | my $oTag = "O"; # outside tag, default O 51 | my $raw = 0; # raw input: add B to every token 52 | my $recall = 0.0; # recall score 53 | my $tokenCounter = 0; # token counter (ignores sentence breaks) 54 | 55 | my %correctChunk = (); # number of correctly identified chunks per type 56 | my %foundCorrect = (); # number of chunks in corpus per type 57 | my %foundGuessed = (); # number of identified chunks per type 58 | 59 | my @features; # features on line 60 | my @sortedTypes; # sorted list of chunk type names 61 | 62 | # sanity check 63 | while (@ARGV and $ARGV[0] =~ /^-/) { 64 | if ($ARGV[0] eq "-l") { $latex = 1; shift(@ARGV); } 65 | elsif ($ARGV[0] eq "-r") { $raw = 1; shift(@ARGV); } 66 | elsif ($ARGV[0] eq "-d") { 67 | shift(@ARGV); 68 | if (not defined $ARGV[0]) { 69 | die "conlleval: -d requires delimiter character"; 70 | } 71 | $delimiter = shift(@ARGV); 72 | } elsif ($ARGV[0] eq "-o") { 73 | shift(@ARGV); 74 | if (not defined $ARGV[0]) { 75 | die "conlleval: -o requires delimiter character"; 76 | } 77 | $oTag = shift(@ARGV); 78 | } else { die "conlleval: unknown argument $ARGV[0]\n"; } 79 | } 80 | if (@ARGV) { die "conlleval: unexpected command line argument\n"; } 81 | # process input 82 | while () { 83 | chomp($line = $_); 84 | @features = split(/$delimiter/,$line); 85 | if ($nbrOfFeatures < 0) { $nbrOfFeatures = $#features; } 86 | elsif ($nbrOfFeatures != $#features and @features != 0) { 87 | printf STDERR "unexpected number of features: %d (%d)\n", 88 | $#features+1,$nbrOfFeatures+1; 89 | exit(1); 90 | } 91 | if (@features == 0 or 92 | $features[0] eq $boundary) { @features = ($boundary,"O","O"); } 93 | if (@features < 2) { 94 | die "conlleval: unexpected number of features in line $line\n"; 95 | } 96 | if ($raw) { 97 | if ($features[$#features] eq $oTag) { $features[$#features] = "O"; } 98 | if ($features[$#features-1] eq $oTag) { $features[$#features-1] = "O"; } 99 | if ($features[$#features] ne "O") { 100 | $features[$#features] = "B-$features[$#features]"; 101 | } 102 | if ($features[$#features-1] ne "O") { 103 | $features[$#features-1] = "B-$features[$#features-1]"; 104 | } 105 | } 106 | # 20040126 ET code which allows hyphens in the types 107 | if ($features[$#features] =~ /^([^-]*)-(.*)$/) { 108 | $guessed = $1; 109 | $guessedType = $2; 110 | } else { 111 | $guessed = $features[$#features]; 112 | $guessedType = ""; 113 | } 114 | pop(@features); 115 | if ($features[$#features] =~ /^([^-]*)-(.*)$/) { 116 | $correct = $1; 117 | $correctType = $2; 118 | } else { 119 | $correct = $features[$#features]; 120 | $correctType = ""; 121 | } 122 | pop(@features); 123 | # ($guessed,$guessedType) = split(/-/,pop(@features)); 124 | # ($correct,$correctType) = split(/-/,pop(@features)); 125 | $guessedType = $guessedType ? $guessedType : ""; 126 | $correctType = $correctType ? $correctType : ""; 127 | $firstItem = shift(@features); 128 | 129 | # 1999-06-26 sentence breaks should always be counted as out of chunk 130 | if ( $firstItem eq $boundary ) { $guessed = "O"; } 131 | 132 | if ($inCorrect) { 133 | if ( &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and 134 | &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and 135 | $lastGuessedType eq $lastCorrectType) { 136 | $inCorrect=$false; 137 | $correctChunk++; 138 | $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ? 139 | $correctChunk{$lastCorrectType}+1 : 1; 140 | } elsif ( 141 | &endOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) != 142 | &endOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) or 143 | $guessedType ne $correctType ) { 144 | $inCorrect=$false; 145 | } 146 | } 147 | 148 | if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) and 149 | &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) and 150 | $guessedType eq $correctType) { $inCorrect = $true; } 151 | 152 | if ( &startOfChunk($lastCorrect,$correct,$lastCorrectType,$correctType) ) { 153 | $foundCorrect++; 154 | $foundCorrect{$correctType} = $foundCorrect{$correctType} ? 155 | $foundCorrect{$correctType}+1 : 1; 156 | } 157 | if ( &startOfChunk($lastGuessed,$guessed,$lastGuessedType,$guessedType) ) { 158 | $foundGuessed++; 159 | $foundGuessed{$guessedType} = $foundGuessed{$guessedType} ? 160 | $foundGuessed{$guessedType}+1 : 1; 161 | } 162 | if ( $firstItem ne $boundary ) { 163 | if ( $correct eq $guessed and $guessedType eq $correctType ) { 164 | $correctTags++; 165 | } 166 | $tokenCounter++; 167 | } 168 | 169 | $lastGuessed = $guessed; 170 | $lastCorrect = $correct; 171 | $lastGuessedType = $guessedType; 172 | $lastCorrectType = $correctType; 173 | } 174 | if ($inCorrect) { 175 | $correctChunk++; 176 | $correctChunk{$lastCorrectType} = $correctChunk{$lastCorrectType} ? 177 | $correctChunk{$lastCorrectType}+1 : 1; 178 | } 179 | 180 | if (not $latex) { 181 | # compute overall precision, recall and FB1 (default values are 0.0) 182 | $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0); 183 | $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0); 184 | $FB1 = 2*$precision*$recall/($precision+$recall) 185 | if ($precision+$recall > 0); 186 | 187 | # print overall performance 188 | printf "processed $tokenCounter tokens with $foundCorrect phrases; "; 189 | printf "found: $foundGuessed phrases; correct: $correctChunk.\n"; 190 | if ($tokenCounter>0) { 191 | printf "accuracy: %6.2f%%; ",100*$correctTags/$tokenCounter; 192 | printf "precision: %6.2f%%; ",$precision; 193 | printf "recall: %6.2f%%; ",$recall; 194 | printf "FB1: %6.2f\n",$FB1; 195 | } 196 | } 197 | 198 | # sort chunk type names 199 | undef($lastType); 200 | @sortedTypes = (); 201 | foreach $i (sort (keys %foundCorrect,keys %foundGuessed)) { 202 | if (not($lastType) or $lastType ne $i) { 203 | push(@sortedTypes,($i)); 204 | } 205 | $lastType = $i; 206 | } 207 | # print performance per chunk type 208 | if (not $latex) { 209 | for $i (@sortedTypes) { 210 | $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0; 211 | if (not($foundGuessed{$i})) { $foundGuessed{$i} = 0; $precision = 0.0; } 212 | else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; } 213 | if (not($foundCorrect{$i})) { $recall = 0.0; } 214 | else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; } 215 | if ($precision+$recall == 0.0) { $FB1 = 0.0; } 216 | else { $FB1 = 2*$precision*$recall/($precision+$recall); } 217 | printf "%17s: ",$i; 218 | printf "precision: %6.2f%%; ",$precision; 219 | printf "recall: %6.2f%%; ",$recall; 220 | printf "FB1: %6.2f %d\n",$FB1,$foundGuessed{$i}; 221 | } 222 | } else { 223 | print " & Precision & Recall & F\$_{\\beta=1} \\\\\\hline"; 224 | for $i (@sortedTypes) { 225 | $correctChunk{$i} = $correctChunk{$i} ? $correctChunk{$i} : 0; 226 | if (not($foundGuessed{$i})) { $precision = 0.0; } 227 | else { $precision = 100*$correctChunk{$i}/$foundGuessed{$i}; } 228 | if (not($foundCorrect{$i})) { $recall = 0.0; } 229 | else { $recall = 100*$correctChunk{$i}/$foundCorrect{$i}; } 230 | if ($precision+$recall == 0.0) { $FB1 = 0.0; } 231 | else { $FB1 = 2*$precision*$recall/($precision+$recall); } 232 | printf "\n%-7s & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\", 233 | $i,$precision,$recall,$FB1; 234 | } 235 | print "\\hline\n"; 236 | $precision = 0.0; 237 | $recall = 0; 238 | $FB1 = 0.0; 239 | $precision = 100*$correctChunk/$foundGuessed if ($foundGuessed > 0); 240 | $recall = 100*$correctChunk/$foundCorrect if ($foundCorrect > 0); 241 | $FB1 = 2*$precision*$recall/($precision+$recall) 242 | if ($precision+$recall > 0); 243 | printf "Overall & %6.2f\\%% & %6.2f\\%% & %6.2f \\\\\\hline\n", 244 | $precision,$recall,$FB1; 245 | } 246 | 247 | exit 0; 248 | 249 | # endOfChunk: checks if a chunk ended between the previous and current word 250 | # arguments: previous and current chunk tags, previous and current types 251 | # note: this code is capable of handling other chunk representations 252 | # than the default CoNLL-2000 ones, see EACL'99 paper of Tjong 253 | # Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006 254 | 255 | sub endOfChunk { 256 | my $prevTag = shift(@_); 257 | my $tag = shift(@_); 258 | my $prevType = shift(@_); 259 | my $type = shift(@_); 260 | my $chunkEnd = $false; 261 | 262 | if ( $prevTag eq "B" and $tag eq "B" ) { $chunkEnd = $true; } 263 | if ( $prevTag eq "B" and $tag eq "O" ) { $chunkEnd = $true; } 264 | if ( $prevTag eq "B" and $tag eq "S" ) { $chunkEnd = $true; } 265 | 266 | if ( $prevTag eq "I" and $tag eq "B" ) { $chunkEnd = $true; } 267 | if ( $prevTag eq "I" and $tag eq "S" ) { $chunkEnd = $true; } 268 | if ( $prevTag eq "I" and $tag eq "O" ) { $chunkEnd = $true; } 269 | 270 | if ( $prevTag eq "E" and $tag eq "E" ) { $chunkEnd = $true; } 271 | if ( $prevTag eq "E" and $tag eq "I" ) { $chunkEnd = $true; } 272 | if ( $prevTag eq "E" and $tag eq "O" ) { $chunkEnd = $true; } 273 | if ( $prevTag eq "E" and $tag eq "S" ) { $chunkEnd = $true; } 274 | if ( $prevTag eq "E" and $tag eq "B" ) { $chunkEnd = $true; } 275 | 276 | if ( $prevTag eq "S" and $tag eq "E" ) { $chunkEnd = $true; } 277 | if ( $prevTag eq "S" and $tag eq "I" ) { $chunkEnd = $true; } 278 | if ( $prevTag eq "S" and $tag eq "O" ) { $chunkEnd = $true; } 279 | if ( $prevTag eq "S" and $tag eq "S" ) { $chunkEnd = $true; } 280 | if ( $prevTag eq "S" and $tag eq "B" ) { $chunkEnd = $true; } 281 | 282 | 283 | if ($prevTag ne "O" and $prevTag ne "." and $prevType ne $type) { 284 | $chunkEnd = $true; 285 | } 286 | 287 | # corrected 1998-12-22: these chunks are assumed to have length 1 288 | if ( $prevTag eq "]" ) { $chunkEnd = $true; } 289 | if ( $prevTag eq "[" ) { $chunkEnd = $true; } 290 | 291 | return($chunkEnd); 292 | } 293 | 294 | # startOfChunk: checks if a chunk started between the previous and current word 295 | # arguments: previous and current chunk tags, previous and current types 296 | # note: this code is capable of handling other chunk representations 297 | # than the default CoNLL-2000 ones, see EACL'99 paper of Tjong 298 | # Kim Sang and Veenstra http://xxx.lanl.gov/abs/cs.CL/9907006 299 | 300 | sub startOfChunk { 301 | my $prevTag = shift(@_); 302 | my $tag = shift(@_); 303 | my $prevType = shift(@_); 304 | my $type = shift(@_); 305 | my $chunkStart = $false; 306 | 307 | if ( $prevTag eq "B" and $tag eq "B" ) { $chunkStart = $true; } 308 | if ( $prevTag eq "I" and $tag eq "B" ) { $chunkStart = $true; } 309 | if ( $prevTag eq "O" and $tag eq "B" ) { $chunkStart = $true; } 310 | if ( $prevTag eq "S" and $tag eq "B" ) { $chunkStart = $true; } 311 | if ( $prevTag eq "E" and $tag eq "B" ) { $chunkStart = $true; } 312 | 313 | if ( $prevTag eq "B" and $tag eq "S" ) { $chunkStart = $true; } 314 | if ( $prevTag eq "I" and $tag eq "S" ) { $chunkStart = $true; } 315 | if ( $prevTag eq "O" and $tag eq "S" ) { $chunkStart = $true; } 316 | if ( $prevTag eq "S" and $tag eq "S" ) { $chunkStart = $true; } 317 | if ( $prevTag eq "E" and $tag eq "S" ) { $chunkStart = $true; } 318 | 319 | if ( $prevTag eq "O" and $tag eq "I" ) { $chunkStart = $true; } 320 | if ( $prevTag eq "S" and $tag eq "I" ) { $chunkStart = $true; } 321 | if ( $prevTag eq "E" and $tag eq "I" ) { $chunkStart = $true; } 322 | 323 | if ( $prevTag eq "S" and $tag eq "E" ) { $chunkStart = $true; } 324 | if ( $prevTag eq "E" and $tag eq "E" ) { $chunkStart = $true; } 325 | if ( $prevTag eq "O" and $tag eq "E" ) { $chunkStart = $true; } 326 | 327 | if ($tag ne "O" and $tag ne "." and $prevType ne $type) { 328 | $chunkStart = $true; 329 | } 330 | 331 | # corrected 1998-12-22: these chunks are assumed to have length 1 332 | if ( $tag eq "[" ) { $chunkStart = $true; } 333 | if ( $tag eq "]" ) { $chunkStart = $true; } 334 | 335 | return($chunkStart); 336 | } 337 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/crf.py: -------------------------------------------------------------------------------- 1 | import torch 2 | from torch import nn 3 | 4 | 5 | class CRF(nn.Module): 6 | def __init__(self, label_size): 7 | '''label_size = real size + 2, included START and END ''' 8 | super(CRF, self).__init__() 9 | 10 | self.label_size = label_size 11 | self.start = self.label_size - 2 12 | self.end = self.label_size - 1 13 | transition = torch.randn(self.label_size, self.label_size) 14 | self.transition = nn.Parameter(transition) 15 | self.initialize() 16 | 17 | def initialize(self): 18 | nn.init.uniform_(self.transition.data, -0.1, 0.1) 19 | self.transition.data[:, self.end] = -1000.0 20 | self.transition.data[self.start, :] = -1000.0 21 | 22 | def pad_logits(self, logits): 23 | batch_size, seq_len, label_num = logits.size() 24 | pads = logits.new_full((batch_size, seq_len, 2), -1000.0, 25 | requires_grad=False) 26 | logits = torch.cat([logits, pads], dim=2) 27 | return logits 28 | 29 | def calc_binary_score(self, labels, predict_mask): 30 | ''' 31 | Gold transition score 32 | :param labels: [batch_size, seq_len] LongTensor 33 | :param predict_mask: [batch_size, seq_len] LongTensor 34 | :return: [batch_size] FloatTensor 35 | ''' 36 | batch_size, seq_len = labels.size() 37 | 38 | labels_ext = labels.new_empty((batch_size, seq_len + 2)) 39 | labels_ext[:, 0] = self.start 40 | labels_ext[:, 1:-1] = labels 41 | labels_ext[:, -1] = self.end 42 | pad = predict_mask.new_ones([batch_size, 1], requires_grad=False) 43 | pad_stop = labels.new_full([batch_size, 1], self.end, requires_grad=False) 44 | mask = torch.cat([pad, predict_mask, pad], dim=-1).long() 45 | labels = (1 - mask) * pad_stop + mask * labels_ext 46 | 47 | trn = self.transition 48 | trn_exp = trn.unsqueeze(0).expand(batch_size, *trn.size()) 49 | lbl_r = labels[:, 1:] 50 | lbl_rexp = lbl_r.unsqueeze(-1).expand(*lbl_r.size(), trn.size(0)) 51 | trn_row = torch.gather(trn_exp, 1, lbl_rexp) 52 | 53 | lbl_lexp = labels[:, :-1].unsqueeze(-1) 54 | trn_scr = torch.gather(trn_row, 2, lbl_lexp) 55 | trn_scr = trn_scr.squeeze(-1) 56 | 57 | mask = torch.cat([pad, predict_mask], dim=-1).float() 58 | trn_scr = trn_scr * mask 59 | score = trn_scr 60 | 61 | return score 62 | 63 | def calc_unary_score(self, logits, labels, predict_mask): 64 | ''' 65 | Gold logits score 66 | :param logits: [batch_size, seq_len, n_labels] FloatTensor 67 | :param labels: [batch_size, seq_len] LongTensor 68 | :param predict_mask: [batch_size, seq_len] LongTensor 69 | :return: [batch_size] FloatTensor 70 | ''' 71 | labels_exp = labels.unsqueeze(-1) 72 | scores = torch.gather(logits, 2, labels_exp).squeeze(-1) 73 | scores = scores * predict_mask.float() 74 | return scores 75 | 76 | def calc_gold_score(self, logits, labels, predict_mask): 77 | ''' 78 | Total score of gold sequence. 79 | :param logits: [batch_size, seq_len, n_labels] FloatTensor 80 | :param labels: [batch_size, seq_len] LongTensor 81 | :param predict_mask: [batch_size, seq_len] LongTensor 82 | :return: [batch_size] FloatTensor 83 | ''' 84 | unary_score = self.calc_unary_score(logits, labels, predict_mask).sum( 85 | 1).squeeze(-1) 86 | # print(unary_score) 87 | binary_score = self.calc_binary_score(labels, predict_mask).sum(1).squeeze(-1) 88 | # print(binary_score) 89 | return unary_score + binary_score 90 | 91 | def calc_norm_score(self, logits, predict_mask): 92 | ''' 93 | Total score of all sequences. 94 | :param logits: [batch_size, seq_len, n_labels] FloatTensor 95 | :param predict_mask: [batch_size, seq_len] LongTensor 96 | :return: [batch_size] FloatTensor 97 | ''' 98 | batch_size, seq_len, feat_dim = logits.size() 99 | 100 | alpha = logits.new_full((batch_size, self.label_size), -100.0) 101 | alpha[:, self.start] = 0 102 | 103 | predict_mask_ = predict_mask.clone() # (batch_size, max_seq) 104 | 105 | logits_t = logits.transpose(1, 0) # (max_seq, batch_size, num_labels + 2) 106 | predict_mask_ = predict_mask_.transpose(1, 0) # (max_seq, batch_size) 107 | for word_mask_, logit in zip(predict_mask_, logits_t): 108 | logit_exp = logit.unsqueeze(-1).expand(batch_size, 109 | *self.transition.size()) 110 | alpha_exp = alpha.unsqueeze(1).expand(batch_size, 111 | *self.transition.size()) 112 | trans_exp = self.transition.unsqueeze(0).expand_as(alpha_exp) 113 | mat = logit_exp + alpha_exp + trans_exp 114 | alpha_nxt = log_sum_exp(mat, 2).squeeze(-1) 115 | 116 | mask = word_mask_.float().unsqueeze(-1).expand_as(alpha) # (batch_size, num_labels+2) 117 | alpha = mask * alpha_nxt + (1 - mask) * alpha 118 | 119 | alpha = alpha + self.transition[self.end].unsqueeze(0).expand_as(alpha) 120 | norm = log_sum_exp(alpha, 1).squeeze(-1) 121 | 122 | return norm 123 | 124 | def viterbi_decode(self, logits, predict_mask): 125 | """ 126 | :param logits: [batch_size, seq_len, n_labels] FloatTensor 127 | :param predict_mask: [batch_size, seq_len] LongTensor 128 | :return scores: [batch_size] FloatTensor 129 | :return paths: [batch_size, seq_len] LongTensor 130 | """ 131 | batch_size, seq_len, n_labels = logits.size() 132 | vit = logits.new_full((batch_size, self.label_size), -100.0) 133 | vit[:, self.start] = 0 134 | predict_mask_ = predict_mask.clone() # (batch_size, max_seq) 135 | predict_mask_ = predict_mask_.transpose(1, 0) # (max_seq, batch_size) 136 | logits_t = logits.transpose(1, 0) 137 | pointers = [] 138 | for ix, logit in enumerate(logits_t): 139 | vit_exp = vit.unsqueeze(1).expand(batch_size, n_labels, n_labels) 140 | trn_exp = self.transition.unsqueeze(0).expand_as(vit_exp) 141 | vit_trn_sum = vit_exp + trn_exp 142 | vt_max, vt_argmax = vit_trn_sum.max(2) 143 | 144 | vt_max = vt_max.squeeze(-1) 145 | vit_nxt = vt_max + logit 146 | pointers.append(vt_argmax.squeeze(-1).unsqueeze(0)) 147 | 148 | mask = predict_mask_[ix].float().unsqueeze(-1).expand_as(vit_nxt) 149 | vit = mask * vit_nxt + (1 - mask) * vit 150 | 151 | mask = (predict_mask_[ix:].sum(0) == 1).float().unsqueeze(-1).expand_as(vit_nxt) 152 | vit += mask * self.transition[self.end].unsqueeze( 153 | 0).expand_as(vit_nxt) 154 | 155 | pointers = torch.cat(pointers) 156 | scores, idx = vit.max(1) 157 | paths = [idx.unsqueeze(1)] 158 | for argmax in reversed(pointers): 159 | idx_exp = idx.unsqueeze(-1) 160 | idx = torch.gather(argmax, 1, idx_exp) 161 | idx = idx.squeeze(-1) 162 | 163 | paths.insert(0, idx.unsqueeze(1)) 164 | 165 | paths = torch.cat(paths[1:], 1) 166 | scores = scores.squeeze(-1) 167 | 168 | return scores, paths 169 | 170 | 171 | def log_sum_exp(tensor, dim=0): 172 | """LogSumExp operation.""" 173 | m, _ = torch.max(tensor, dim) 174 | m_exp = m.unsqueeze(-1).expand_as(tensor) 175 | return m + torch.log(torch.sum(torch.exp(tensor - m_exp), dim)) 176 | 177 | 178 | def test(): 179 | torch.manual_seed(2) 180 | logits = torch.tensor([[[1.2, 2.1], [2.8, 2.1], [2.2, -2.1]], [[4.1, 2.2], [2.8, 2.1], [2.2, -2.1]]]) # 2, 3, 2 181 | predict_mask = torch.tensor([[1, 1, 0], [1, 0, 0]]) # 2, 3 182 | labels = torch.tensor([[1, 0, 0], [0, 1, 1]]) # 2, 3 183 | 184 | crf = CRF(4) 185 | logits = crf.pad_logits(logits) 186 | norm_score = crf.calc_norm_score(logits, predict_mask) 187 | print(norm_score) 188 | gold_score = crf.calc_gold_score(logits, labels, predict_mask) 189 | print(gold_score) 190 | loglik = gold_score - norm_score 191 | print(loglik) 192 | print(crf.viterbi_decode(logits, predict_mask)) 193 | 194 | # test() 195 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | from crf import CRF 4 | import torch.nn.functional as F 5 | import math 6 | 7 | class CharCNN(nn.Module): 8 | def __init__(self, num_filters, kernel_sizes, padding): 9 | super(CharCNN, self).__init__() 10 | self.conv = nn.Conv2d(1, num_filters, kernel_sizes, padding=padding) 11 | 12 | def forward(self, x): 13 | ''' 14 | :param x: (batch * seq_len, 1, max_word_len, char_embed_size) 15 | :return: (batch * seq_len, num_filters) 16 | ''' 17 | x = self.conv(x).squeeze(-1) # (batch * seq_len, num_filters, max_word_len) 18 | x_max = F.max_pool1d(x, x.size(2)).squeeze(-1) # (batch * seq_len, num_filters) 19 | return x_max 20 | 21 | class BiLSTM(nn.Module): 22 | def __init__(self, input_size, hidden_size=128, dropout_rate=0.1, layer_num=1): 23 | super(BiLSTM, self).__init__() 24 | self.hidden_size = hidden_size 25 | self.layer_num = layer_num 26 | if layer_num == 1: 27 | self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, bidirectional=True) 28 | 29 | else: 30 | self.bilstm = nn.LSTM(input_size, hidden_size // 2, layer_num, batch_first=True, dropout=dropout_rate, 31 | bidirectional=True) 32 | self.init_weights() 33 | 34 | def init_weights(self): 35 | for name, p in self.bilstm._parameters.items(): 36 | if p.dim() > 1: 37 | bias = math.sqrt(6 / (p.size(0) / 4 + p.size(1))) 38 | nn.init.uniform_(p, -bias, bias) 39 | else: 40 | p.data.zero_() 41 | # This is the range of indices for our forget gates for each LSTM cell 42 | p.data[self.hidden_size // 2: self.hidden_size] = 1 43 | 44 | def forward(self, x, lens): 45 | ''' 46 | :param x: (batch, seq_len, input_size) 47 | :param lens: (batch, ) 48 | :return: (batch, seq_len, hidden_size) 49 | ''' 50 | ordered_lens, index = lens.sort(descending=True) 51 | ordered_x = x[index] 52 | packed_x = nn.utils.rnn.pack_padded_sequence(ordered_x, ordered_lens, batch_first=True) 53 | packed_output, _ = self.bilstm(packed_x) 54 | output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True) 55 | recover_index = index.argsort() 56 | output = output[recover_index] 57 | return output 58 | 59 | 60 | class SoftmaxDecoder(nn.Module): 61 | def __init__(self, label_size, input_dim): 62 | super(SoftmaxDecoder, self).__init__() 63 | self.input_dim = input_dim 64 | self.label_size = label_size 65 | self.linear = torch.nn.Linear(input_dim, label_size) 66 | self.init_weights() 67 | 68 | def init_weights(self): 69 | bias = math.sqrt(6 / (self.linear.weight.size(0) + self.linear.weight.size(1))) 70 | nn.init.uniform_(self.linear.weight, -bias, bias) 71 | 72 | def forward_model(self, inputs): 73 | batch_size, seq_len, input_dim = inputs.size() 74 | output = inputs.contiguous().view(-1, self.input_dim) 75 | output = self.linear(output) 76 | output = output.view(batch_size, seq_len, self.label_size) 77 | return output 78 | 79 | def forward(self, inputs, lens, label_ids=None): 80 | logits = self.forward_model(inputs) 81 | p = torch.nn.functional.softmax(logits, -1) # (batch_size, max_seq_len, num_labels) 82 | predict_mask = (torch.arange(inputs.size(1)).expand(len(lens), inputs.size(1))).to(lens.device) < lens.unsqueeze(1) 83 | if label_ids is not None: 84 | # cross entropy loss 85 | p = torch.nn.functional.softmax(logits, -1) # (batch_size, max_seq_len, num_labels) 86 | one_hot_labels = torch.eye(self.label_size)[label_ids].type_as(p) 87 | losses = -torch.log(torch.sum(one_hot_labels * p, -1)) # (batch_size, max_seq_len) 88 | masked_losses = torch.masked_select(losses, predict_mask) # (batch_sum_real_len) 89 | return masked_losses.sum() 90 | else: 91 | return torch.argmax(logits, -1), p 92 | 93 | class CRFDecoder(nn.Module): 94 | def __init__(self, label_size, input_dim): 95 | super(CRFDecoder, self).__init__() 96 | self.input_dim = input_dim 97 | self.linear = nn.Linear(in_features=input_dim, 98 | out_features=label_size) 99 | self.crf = CRF(label_size + 2) 100 | self.label_size = label_size 101 | 102 | self.init_weights() 103 | 104 | def init_weights(self): 105 | bias = math.sqrt(6 / (self.linear.weight.size(0) + self.linear.weight.size(1))) 106 | nn.init.uniform_(self.linear.weight, -bias, bias) 107 | 108 | def forward_model(self, inputs): 109 | batch_size, seq_len, input_dim = inputs.size() 110 | output = inputs.contiguous().view(-1, self.input_dim) 111 | output = self.linear(output) 112 | output = output.view(batch_size, seq_len, self.label_size) 113 | return output 114 | 115 | def forward(self, inputs, lens, labels=None): 116 | ''' 117 | :param inputs:(batch_size, max_seq_len, input_dim) 118 | :param predict_mask:(batch_size, max_seq_len) 119 | :param labels:(batch_size, max_seq_len) 120 | :return: if labels is None, return preds(batch_size, max_seq_len) and p(batch_size, max_seq_len, num_labels); 121 | else return loss (scalar). 122 | ''' 123 | logits = self.forward_model(inputs) # (batch_size, max_seq_len, num_labels) 124 | p = torch.nn.functional.softmax(logits, -1) # (batch_size, max_seq_len, num_labels) 125 | logits = self.crf.pad_logits(logits) 126 | predict_mask = (torch.arange(inputs.size(1)).expand(len(lens), inputs.size(1))).to(lens.device) < lens.unsqueeze(1) 127 | if labels is None: 128 | _, preds = self.crf.viterbi_decode(logits, predict_mask) 129 | return preds, p 130 | return self.neg_log_likehood(logits, predict_mask, labels) 131 | 132 | def neg_log_likehood(self, logits, predict_mask, labels): 133 | norm_score = self.crf.calc_norm_score(logits, predict_mask) 134 | gold_score = self.crf.calc_gold_score(logits, labels, predict_mask) 135 | loglik = gold_score - norm_score 136 | return -loglik.sum() 137 | 138 | 139 | class NER_Model(nn.Module): 140 | def __init__(self, word_embed, char_embed, 141 | num_labels, hidden_size, dropout_rate=(0.33, 0.5, (0.33, 0.5)), 142 | lstm_layer_num=1, kernel_step=3, char_out_size=100, use_char=False, 143 | freeze=False, use_crf=True): 144 | super(NER_Model, self).__init__() 145 | self.word_embed = nn.Embedding.from_pretrained(word_embed, freeze) 146 | self.word_embed_size = word_embed.size(-1) 147 | self.use_char = use_char 148 | if use_char: 149 | self.char_embed = nn.Embedding.from_pretrained(char_embed, freeze) 150 | self.char_embed_size = char_embed.size(-1) 151 | self.charcnn = CharCNN(char_out_size, (kernel_step, self.char_embed_size), (2, 0)) 152 | self.bilstm = BiLSTM(char_out_size + self.word_embed_size, hidden_size, dropout_rate[2][1], lstm_layer_num) 153 | else: 154 | self.bilstm = BiLSTM(self.word_embed_size, hidden_size, dropout_rate[2][1], lstm_layer_num) 155 | 156 | self.embed_dropout = nn.Dropout(dropout_rate[0]) 157 | self.out_dropout = nn.Dropout(dropout_rate[1]) 158 | self.rnn_in_dropout = nn.Dropout(dropout_rate[2][0]) 159 | 160 | if use_crf: 161 | self.decoder = CRFDecoder(num_labels, hidden_size) 162 | else: 163 | self.decoder = SoftmaxDecoder(num_labels, hidden_size) 164 | 165 | 166 | def forward(self, word_ids, char_ids, lens, label_ids=None): 167 | ''' 168 | 169 | :param word_ids: (batch_size, max_seq_len) 170 | :param char_ids: (batch_size, max_seq_len, max_word_len) 171 | :param predict_mask: (batch_size, max_seq_len) 172 | :param label_ids: (batch_size, max_seq_len, max_word_len) 173 | :return: if labels is None, return preds(batch_size, max_seq_len) and p(batch_size, max_seq_len, num_labels); 174 | else return loss (scalar). 175 | ''' 176 | word_embed = self.word_embed(word_ids) 177 | if self.char_embed: 178 | # reshape char_embed and apply to CNN 179 | char_embed = self.char_embed(char_ids).reshape(-1, char_ids.size(-1), self.char_embed_size).unsqueeze(1) 180 | char_embed = self.embed_dropout( 181 | char_embed) # a dropout layer applied before character embeddings are input to CNN. 182 | char_embed = self.charcnn(char_embed) 183 | char_embed = char_embed.reshape(char_ids.size(0), char_ids.size(1), -1) 184 | embed = torch.cat([word_embed, char_embed], -1) 185 | else: 186 | embed = word_embed 187 | x = self.rnn_in_dropout(embed) 188 | hidden = self.bilstm(x, lens) # (batch_size, max_seq_len, hidden_size) 189 | hidden = self.out_dropout(hidden) 190 | return self.decoder(hidden, lens, label_ids) 191 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/run.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf8 -*- 2 | import torch 3 | import torch.optim as optim 4 | from tqdm import tqdm 5 | from torchtext.vocab import Vectors 6 | from models import NER_Model 7 | import codecs 8 | from util import load_iters, get_chunks 9 | 10 | torch.manual_seed(1) 11 | DEVICE = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 12 | 13 | DATA_PATH = "data" 14 | PREDICT_OUT_FILE = "res2.txt" 15 | BEST_MODEL = "best_model2.ckpt" 16 | BATCH_SIZE = 10 17 | LOWER_CASE = False 18 | EPOCHS = 200 19 | 20 | # embedding 21 | WORD_VECTORS = None 22 | # WORD_VECTORS = Vectors('glove.6B.100d.txt', '../../embeddings/glove.6B') 23 | WORD_EMBEDDING_SIZE = 100 24 | CHAR_VECTORS = None 25 | CHAR_EMBEDDING_SIZE = 30 # the input char embedding to CNN 26 | FREEZE_EMBEDDING = False 27 | 28 | # SGD parameters 29 | LEARNING_RATE = 0.015 30 | DECAY_RATE = 0.05 31 | MOMENTUM = 0.9 32 | CLIP = 5 33 | PATIENCE = 5 34 | 35 | # network parameters 36 | HIDDEN_SIZE = 400 # every LSTM's(forward and backward) hidden size is half of HIDDEN_SIZE 37 | LSTM_LAYER_NUM = 1 38 | DROPOUT_RATE = (0.5, 0.5, (0.5, 0.5)) # after embed layer, other case, (input to rnn, between rnn layers) 39 | USE_CHAR = True # use char level information 40 | N_FILTERS = 30 # the output char embedding from CNN 41 | KERNEL_STEP = 3 # n-gram size of CNN 42 | USE_CRF = True 43 | 44 | 45 | def train(train_iter, dev_iter, optimizer): 46 | best_dev_f1 = -1 47 | patience_counter = 0 48 | for epoch in range(1, EPOCHS + 1): 49 | model.train() 50 | total_loss = 0 51 | train_iter.init_epoch() 52 | for i, batch in enumerate(tqdm(train_iter)): 53 | words, lens = batch.word 54 | labels = batch.label 55 | if i < 2: 56 | tqdm.write(' '.join([WORD.vocab.itos[i] for i in words[0]])) 57 | tqdm.write(' '.join([LABEL.vocab.itos[i] for i in labels[0]])) 58 | model.zero_grad() 59 | loss = model(words, batch.char, lens, labels) 60 | total_loss += loss.item() 61 | loss.backward() 62 | torch.nn.utils.clip_grad_norm_(model.parameters(), CLIP) 63 | optimizer.step() 64 | tqdm.write("Epoch: %d, Train Loss: %d" % (epoch, total_loss)) 65 | 66 | lr = LEARNING_RATE / (1 + DECAY_RATE * epoch) 67 | for param_group in optimizer.param_groups: 68 | param_group['lr'] = lr 69 | 70 | dev_f1 = eval(dev_iter, "Dev", epoch) 71 | if dev_f1 < best_dev_f1: 72 | patience_counter += 1 73 | tqdm.write("No improvement, patience: %d/%d" % (patience_counter, PATIENCE)) 74 | else: 75 | best_dev_f1 = dev_f1 76 | patience_counter = 0 77 | torch.save(model.state_dict(), BEST_MODEL) 78 | tqdm.write("New best model, saved to best_model.ckpt, patience: 0/%d" % PATIENCE) 79 | if patience_counter >= PATIENCE: 80 | tqdm.write("Early stopping: patience limit reached, stopping...") 81 | break 82 | 83 | 84 | def eval(data_iter, name, epoch=None, best_model=None): 85 | if best_model: 86 | model.load_state_dict(torch.load(best_model)) 87 | model.eval() 88 | with torch.no_grad(): 89 | total_loss = 0 90 | res = {'ootv': [0, 0, 0], 'ooev': [0, 0, 0], 'oobv': [0, 0, 0], 'iv': [0, 0, 0], 91 | 'total': [0, 0, 0]} # e.g. 'iv':[correct_preds, total_preds, total_correct] 92 | for i, batch in enumerate(data_iter): 93 | words, lens = batch.word 94 | labels = batch.label 95 | predicted_seq, _ = model(words, batch.char, lens) # predicted_seq : (batch_size, seq_len) 96 | loss = model(words, batch.char, lens, labels) 97 | total_loss += loss.item() 98 | 99 | orig_text = [e.word for e in data_iter.dataset.examples[i * BATCH_SIZE:(i + 1) * BATCH_SIZE]] 100 | for text, ground_truth_id, predicted_id, len_ in zip(orig_text, labels.cpu().numpy(), 101 | predicted_seq.cpu().numpy(), 102 | lens.cpu().numpy()): 103 | lab_chunks = set(get_chunks(ground_truth_id[:len_], LABEL.vocab.stoi)) 104 | lab_pred_chunks = set(get_chunks(predicted_id[:len_], LABEL.vocab.stoi)) 105 | 106 | for chunk in list(lab_chunks): 107 | entity_word = ' '.join([text[ix] for ix in range(chunk[1], chunk[2])]) 108 | # type, 1:ootv, 2:ooev, 3:oobv, 4:iv 109 | entity_type = WORD.vocab.dev_entity2type[entity_word] if name == "Dev" else \ 110 | WORD.vocab.test_entity2type[entity_word] 111 | if entity_type == 1: 112 | if chunk in lab_pred_chunks: 113 | res['ootv'][0] += 1 114 | res['ootv'][2] += 1 115 | elif entity_type == 2: 116 | if chunk in lab_pred_chunks: 117 | res['ooev'][0] += 1 118 | res['ooev'][2] += 1 119 | elif entity_type == 3: 120 | if chunk in lab_pred_chunks: 121 | res['oobv'][0] += 1 122 | res['oobv'][2] += 1 123 | else: 124 | if chunk in lab_pred_chunks: 125 | res['iv'][0] += 1 126 | res['iv'][2] += 1 127 | if chunk in lab_pred_chunks: 128 | res['total'][0] += 1 129 | res['total'][2] += 1 130 | for chunk in list(lab_pred_chunks): 131 | entity_word = ' '.join([text[ix] for ix in range(chunk[1], chunk[2])]) 132 | # type, 1:ootv, 2:ooev, 3:oobv, 4:iv 133 | entity_type = WORD.vocab.dev_entity2type.get(entity_word, None) if name == "Dev" else \ 134 | WORD.vocab.test_entity2type.get(entity_word, None) 135 | if entity_type == 1: 136 | res['ootv'][1] += 1 137 | elif entity_type == 2: 138 | res['ooev'][1] += 1 139 | elif entity_type == 3: 140 | res['oobv'][1] += 1 141 | elif entity_type == 4: 142 | res['iv'][1] += 1 143 | res['total'][1] += 1 144 | 145 | # Calculating the F1-Score 146 | for k, v in res.items(): 147 | p = v[0] / v[1] if v[1] != 0 else 0 148 | r = v[0] / v[2] if v[2] != 0 else 0 149 | micro_F1 = 2 * p * r / (p + r) if (p + r) != 0 else 0 150 | if epoch is not None: 151 | tqdm.write( 152 | "Epoch: %d, %s, %s Entity Micro F1: %.3f, Loss %.3f" % (epoch, name, k, micro_F1, total_loss)) 153 | else: 154 | tqdm.write( 155 | "%s, %s Entity Micro F1: %.3f, Loss %.3f" % (name, k, micro_F1, total_loss)) 156 | return micro_F1 157 | 158 | 159 | def predict(data_iter, out_file): 160 | model.eval() 161 | with torch.no_grad(): 162 | gold_seqs = [] 163 | predicted_seqs = [] 164 | word_seqs = [] 165 | for i, batch in enumerate(data_iter): 166 | words, lens = batch.word 167 | predicted_seq, _ = model(words, batch.char, lens) # predicted_seq : (batch_size, seq_len) 168 | gold_seqs.extend(batch.label.tolist()) 169 | predicted_seqs.extend(predicted_seq.tolist()) 170 | word_seqs.extend(words.tolist()) 171 | write_predicted_labels(out_file, data_iter.dataset.examples, word_seqs, LABEL.vocab.itos, gold_seqs, 172 | predicted_seqs) 173 | 174 | 175 | def write_predicted_labels(output_file, orig_text, word_ids, id2label, gold_seq, predicted_seq): 176 | with codecs.open(output_file, 'w', encoding='utf-8') as writer: 177 | for text, wids, predict, gold in zip(orig_text, word_ids, predicted_seq, gold_seq): 178 | ix = 0 179 | for w_id, p_id, g_id in zip(wids, predict, gold): 180 | if w_id == pad_idx: break 181 | output_line = ' '.join([text.word[ix], id2label[g_id], id2label[p_id]]) 182 | writer.write(output_line + '\n') 183 | ix += 1 184 | writer.write('\n') 185 | 186 | 187 | if __name__ == "__main__": 188 | train_iter, dev_iter, test_iter, WORD, CHAR, LABEL = load_iters(WORD_EMBEDDING_SIZE, WORD_VECTORS, 189 | CHAR_EMBEDDING_SIZE, CHAR_VECTORS, 190 | BATCH_SIZE, DEVICE, DATA_PATH, LOWER_CASE) 191 | 192 | model = NER_Model(WORD.vocab.vectors, CHAR.vocab.vectors, len(LABEL.vocab.stoi), HIDDEN_SIZE, DROPOUT_RATE, 193 | LSTM_LAYER_NUM, 194 | KERNEL_STEP, N_FILTERS, USE_CHAR, FREEZE_EMBEDDING, USE_CRF).to(DEVICE) 195 | print(model) 196 | pad_idx = WORD.vocab.stoi[WORD.pad_token] 197 | 198 | optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM) 199 | # train(train_iter, dev_iter, optimizer) 200 | eval(test_iter, "Test", best_model=BEST_MODEL) 201 | predict(test_iter, PREDICT_OUT_FILE) 202 | -------------------------------------------------------------------------------- /Task4-Named Entity Recognization/util.py: -------------------------------------------------------------------------------- 1 | from torchtext import data 2 | from torchtext.data import Iterator, BucketIterator 3 | import os 4 | import re 5 | import math 6 | import torch 7 | import numpy as np 8 | 9 | 10 | def read_data(input_file): 11 | """Reads a BIO data.""" 12 | with open(input_file) as f: 13 | lines = [] 14 | words = [] 15 | labels = [] 16 | for line in f: 17 | contends = line.strip() 18 | # if contends.startswith("-DOCSTART-"): 19 | # continue 20 | if len(contends) == 0: 21 | if len(words) == 0: 22 | continue 23 | lines.append([words, [list(word) for word in words], labels]) 24 | words = [] 25 | labels = [] 26 | continue 27 | tokens = line.strip().split(' ') 28 | assert (len(tokens) == 4) 29 | word = tokens[0] 30 | label = tokens[-1] 31 | words.append(word) 32 | labels.append(label) 33 | return lines 34 | 35 | 36 | class ConllDataset(data.Dataset): 37 | 38 | def __init__(self, word_field, char_field, label_field, datafile, **kwargs): 39 | fields = [("word", word_field), ("char", char_field), ("label", label_field)] 40 | datas = read_data(datafile) 41 | examples = [] 42 | for word, char, label in datas: 43 | examples.append(data.Example.fromlist([word, char, label], fields)) 44 | super(ConllDataset, self).__init__(examples, fields, **kwargs) 45 | 46 | 47 | def unk_init(x): 48 | dim = x.size(-1) 49 | bias = math.sqrt(3.0 / dim) 50 | x.uniform_(-bias, bias) 51 | return x 52 | 53 | 54 | def get_char_detail(train, other, embed_vocab=None): 55 | char2type = {} # type, 1:ootv, 2:ooev, 3:oobv, 4:iv 56 | ootv = 0 57 | ootv_set = set() 58 | ooev = 0 59 | oobv = 0 60 | iv = 0 61 | fuzzy_iv = 0 62 | for sent in other: 63 | for w in sent: 64 | for c in w: 65 | if c not in char2type: 66 | if c not in train.stoi: 67 | if embed_vocab and (c in embed_vocab.stoi or c.lower() in embed_vocab.stoi): 68 | ootv += 1 69 | ootv_set.add(c) 70 | char2type[c] = 1 71 | else: 72 | oobv += 1 73 | char2type[c] = 3 74 | else: 75 | if embed_vocab and (c in embed_vocab.stoi or c.lower() in embed_vocab.stoi): 76 | fuzzy_iv += 1 if c.lower() in embed_vocab.stoi else 0 77 | iv += 1 78 | char2type[c] = 4 79 | else: 80 | ooev += 1 81 | char2type[c] = 2 82 | print("IV {}(fuzzy {})\nOOTV {}\nOOEV {}\nOOBV {}\n".format(iv, fuzzy_iv, ootv, ooev, oobv)) 83 | return char2type, ootv_set 84 | 85 | 86 | def get_word_detail(train, other, embed_vocab=None): 87 | ''' 88 | OOTV words are the ones do not appear in training set but in embedding vocabulary 89 | OOEV words are the ones do not appear in embedding vocabulary but in training set 90 | OOBV words are the ones do not appears in both the training and embedding vocabulary 91 | IV words the ones appears in both the training and embedding vocabulary 92 | ''' 93 | word2type = {} # type, 1:ootv, 2:ooev, 3:oobv, 4:iv 94 | ootv = 0 95 | ootv_set = set() 96 | ooev = 0 97 | oobv = 0 98 | iv = 0 99 | fuzzy_iv = 0 100 | for sent in other: 101 | for w in sent: 102 | if w not in word2type: 103 | if w not in train.stoi: 104 | if embed_vocab and (w in embed_vocab.stoi or w.lower() in embed_vocab.stoi): 105 | ootv += 1 106 | ootv_set.add(w) 107 | word2type[w] = 1 108 | else: 109 | oobv += 1 110 | word2type[w] = 3 111 | else: 112 | if embed_vocab and (w in embed_vocab.stoi or w.lower() in embed_vocab.stoi): 113 | fuzzy_iv += 1 if w not in embed_vocab.stoi else 0 114 | iv += 1 115 | word2type[w] = 4 116 | else: 117 | ooev += 1 118 | word2type[w] = 2 119 | print("IV {}(fuzzy {})\nOOTV {}\nOOEV {}\nOOBV {}\n".format(iv, fuzzy_iv, ootv, ooev, oobv)) 120 | return word2type, ootv_set 121 | 122 | 123 | def get_entity_detail(vocab, data, tag2id, embed_vocab=None): 124 | entity2type = {} # type, 1:ootv, 2:ooev, 3:oobv, 4:iv 125 | ootv = 0 # every word of the entity have embedding, but at least one word not in training set 126 | oobv = 0 # an entity is considered as OOBV if there exists at least one word not in training set and at least one word not in embedding vocabulary 127 | ooev = 0 # every word of the entity is in training set, but at least one word not have embedding 128 | iv = 0 129 | for ex in data.examples: 130 | ens = get_chunks(ex.label, tag2id, id_format=False) 131 | for e in ens: 132 | if e not in entity2type: 133 | entity_words = [ex.word[ix] for ix in range(e[1], e[2])] 134 | entity_word = ' '.join(entity_words) 135 | not_in_vocab = len(list(filter(lambda w: w not in vocab.stoi, entity_words))) 136 | if embed_vocab: 137 | not_in_embed = len(list( 138 | filter(lambda w: w not in embed_vocab.stoi and w.lower() not in embed_vocab.stoi, 139 | entity_words))) 140 | if not_in_vocab > 0: 141 | if embed_vocab and not_in_embed == 0: 142 | ootv += 1 143 | entity2type[entity_word] = 1 144 | else: 145 | oobv += 1 146 | entity2type[entity_word] = 3 147 | else: 148 | if embed_vocab and not_in_embed == 0: 149 | iv += 1 150 | entity2type[entity_word] = 4 151 | else: 152 | ooev += 1 153 | entity2type[entity_word] = 2 154 | 155 | print("IV {}\nOOTV {}\nOOEV {}\nOOBV {}\n".format(iv, ootv, ooev, oobv)) 156 | return entity2type 157 | 158 | 159 | def extend(vocab, v, sort=False): 160 | words = sorted(v) if sort else v 161 | for w in words: 162 | if w not in vocab.stoi: 163 | vocab.itos.append(w) 164 | vocab.stoi[w] = len(vocab.itos) - 1 165 | 166 | 167 | def get_entities(vocab, data, tag2id): 168 | entities = {} 169 | unk = 0 170 | conflict = 0 171 | for ex in data.examples: 172 | ens = get_chunks(ex.label, tag2id, id_format=False) 173 | for e in ens: 174 | entity_words = [ex.word[ix] if ex.word[ix] in vocab.stoi else vocab.UNK for ix in range(e[1], e[2])] 175 | entities.setdefault(' '.join(entity_words), set()) 176 | entities[' '.join(entity_words)].add(e[0]) 177 | if vocab.UNK in entity_words: 178 | unk += 1 179 | if len(entities[' '.join(entity_words)]) == 2: 180 | conflict += 1 181 | print("entities contains `UNK` %d\nconflict entities %d\nall entities: %d\n" % (unk, conflict, len(entities))) 182 | return entities 183 | 184 | 185 | def load_iters(word_embed_size, word_vectors, char_embedding_size, char_vectors, batch_size=32, device="cpu", 186 | data_path='data', word2lower=True): 187 | zero_char_in_word = lambda ex: [re.sub('\d', '0', w) for w in ex] 188 | zero_char = lambda w: [re.sub('\d', '0', c) for c in w] 189 | 190 | WORD_TEXT = data.Field(lower=word2lower, batch_first=True, include_lengths=True, 191 | preprocessing=zero_char_in_word) 192 | CHAR_NESTING = data.Field(tokenize=list, preprocessing=zero_char) # process a word in char list 193 | CHAR_TEXT = data.NestedField(CHAR_NESTING) 194 | LABEL = data.Field(unk_token=None, pad_token="O", batch_first=True) 195 | 196 | train_data = ConllDataset(WORD_TEXT, CHAR_TEXT, LABEL, os.path.join(data_path, "train.txt")) 197 | dev_data = ConllDataset(WORD_TEXT, CHAR_TEXT, LABEL, os.path.join(data_path, "dev.txt")) 198 | test_data = ConllDataset(WORD_TEXT, CHAR_TEXT, LABEL, os.path.join(data_path, "test.txt")) 199 | 200 | print("train sentence num / total word num: %d/%d" % ( 201 | len(train_data.examples), np.array([len(_.word) for _ in train_data.examples]).sum())) 202 | print("dev sentence num / total word num: %d/%d" % ( 203 | len(dev_data.examples), np.array([len(_.word) for _ in dev_data.examples]).sum())) 204 | print("test sentence num / total word num: %d/%d" % ( 205 | len(test_data.examples), np.array([len(_.word) for _ in test_data.examples]).sum())) 206 | 207 | LABEL.build_vocab(train_data.label) 208 | WORD_TEXT.build_vocab(train_data.word, max_size=50000, min_freq=1) 209 | CHAR_TEXT.build_vocab(train_data.char, max_size=50000, min_freq=1) 210 | 211 | # ------------------- word oov analysis----------------------- 212 | print('*' * 50 + ' unique words details of dev set ' + '*' * 50) 213 | dev_word2type, dev_ootv_set = get_word_detail(WORD_TEXT.vocab, dev_data.word, word_vectors) 214 | print('#' * 110) 215 | print('*' * 50 + ' unique words details of test set ' + '*' * 50) 216 | test_word2type, test_ootv_set = get_word_detail(WORD_TEXT.vocab, test_data.word, word_vectors) 217 | print('#' * 110) 218 | WORD_TEXT.vocab.dev_word2type = dev_word2type 219 | WORD_TEXT.vocab.test_word2type = test_word2type 220 | 221 | # ------------------- entity oov analysis----------------------- 222 | print('*' * 50 + ' get train entities ' + '*' * 50) 223 | train_entities = get_entities(WORD_TEXT.vocab, train_data, LABEL.vocab.stoi) 224 | print('#' * 110) 225 | print('*' * 50 + ' get dev entities ' + '*' * 50) 226 | dev_entity2type = get_entity_detail(WORD_TEXT.vocab, dev_data, LABEL.vocab.stoi, word_vectors) 227 | print('#' * 110) 228 | print('*' * 50 + ' get test entities ' + '*' * 50) 229 | test_entity2type = get_entity_detail(WORD_TEXT.vocab, test_data, LABEL.vocab.stoi, word_vectors) 230 | print('#' * 110) 231 | WORD_TEXT.vocab.dev_entity2type = dev_entity2type 232 | WORD_TEXT.vocab.test_entity2type = test_entity2type 233 | 234 | # ------------------- extend word vocab with ootv words ----------------------- 235 | print('*' * 50 + 'extending ootv words to vocab' + '*' * 50) 236 | ootv = list(dev_ootv_set.union(test_ootv_set)) 237 | extend(WORD_TEXT.vocab, ootv) 238 | print('extended %d words' % len(ootv)) 239 | print('#' * 110) 240 | 241 | # ------------------- generate word embedding ----------------------- 242 | vectors_to_use = unk_init(torch.zeros((len(WORD_TEXT.vocab), word_embed_size))) 243 | if word_vectors is not None: 244 | vectors_to_use = get_vectors(vectors_to_use, WORD_TEXT.vocab, word_vectors) 245 | WORD_TEXT.vocab.vectors = vectors_to_use 246 | 247 | # ------------------- char oov analysis----------------------- 248 | print('*' * 50 + ' unique chars details of dev set ' + '*' * 50) 249 | dev_char2type, dev_ootv_set = get_char_detail(CHAR_TEXT.vocab, dev_data.char, char_vectors) 250 | print('#' * 110) 251 | print('*' * 50 + ' unique chars details of test set ' + '*' * 50) 252 | test_char2type, test_ootv_set = get_char_detail(CHAR_TEXT.vocab, test_data.char, char_vectors) 253 | print('#' * 110) 254 | CHAR_TEXT.vocab.dev_char2type = dev_char2type 255 | CHAR_TEXT.vocab.test_char2type = test_char2type 256 | 257 | # ------------------- extend char vocab with ootv chars ----------------------- 258 | print('*' * 50 + 'extending ootv chars to vocab' + '*' * 50) 259 | ootv = list(dev_ootv_set.union(test_ootv_set)) 260 | extend(CHAR_TEXT.vocab, ootv) 261 | print('extended %d chars' % len(ootv)) 262 | print('#' * 110) 263 | 264 | # ------------------- generate char embedding ----------------------- 265 | vectors_to_use = unk_init(torch.zeros((len(CHAR_TEXT.vocab), char_embedding_size))) 266 | if char_vectors is not None: 267 | vectors_to_use = get_vectors(vectors_to_use, CHAR_TEXT.vocab, char_vectors) 268 | CHAR_TEXT.vocab.vectors = vectors_to_use 269 | 270 | print("word vocab size: ", len(WORD_TEXT.vocab)) 271 | print("char vocab size: ", len(CHAR_TEXT.vocab)) 272 | print("label vocab size: ", len(LABEL.vocab)) 273 | 274 | train_iter = BucketIterator(train_data, batch_size=batch_size, device=device, sort_key=lambda x: len(x.word), 275 | sort_within_batch=True, repeat=False, shuffle=True) 276 | dev_iter = Iterator(dev_data, batch_size=batch_size, device=device, sort=False, sort_within_batch=False, 277 | repeat=False, shuffle=False) 278 | test_iter = Iterator(test_data, batch_size=batch_size, device=device, sort=False, sort_within_batch=False, 279 | repeat=False, shuffle=False) 280 | return train_iter, dev_iter, test_iter, WORD_TEXT, CHAR_TEXT, LABEL 281 | 282 | 283 | def get_chunk_type(tok, idx_to_tag): 284 | """ 285 | The function takes in a chunk ("B-PER") and then splits it into the tag (PER) and its class (B) 286 | as defined in BIOES 287 | 288 | Args: 289 | tok: id of token, ex 4 290 | idx_to_tag: dictionary {4: "B-PER", ...} 291 | 292 | Returns: 293 | tuple: "B", "PER" 294 | 295 | """ 296 | 297 | tag_name = idx_to_tag[tok] 298 | tag_class = tag_name.split('-')[0] 299 | tag_type = tag_name.split('-')[-1] 300 | return tag_class, tag_type 301 | 302 | 303 | def get_chunks(seq, tags, bioes=True, id_format=True): 304 | """ 305 | Given a sequence of tags, group entities and their position 306 | """ 307 | if not id_format: 308 | seq = [tags[_] for _ in seq] 309 | 310 | # We assume by default the tags lie outside a named entity 311 | default = tags["O"] 312 | 313 | idx_to_tag = {idx: tag for tag, idx in tags.items()} 314 | 315 | chunks = [] 316 | 317 | chunk_class, chunk_type, chunk_start = None, None, None 318 | for i, tok in enumerate(seq): 319 | if tok == default and (chunk_class in (["E", "S"] if bioes else ["B", "I"])): 320 | # Add a chunk. 321 | chunk = (chunk_type, chunk_start, i) 322 | chunks.append(chunk) 323 | chunk_class, chunk_type, chunk_start = "O", None, None 324 | 325 | if tok != default: 326 | tok_chunk_class, tok_chunk_type = get_chunk_type(tok, idx_to_tag) 327 | if chunk_type is None: 328 | # Initialize chunk for each entity 329 | chunk_class, chunk_type, chunk_start = tok_chunk_class, tok_chunk_type, i 330 | else: 331 | if bioes: 332 | if chunk_class in ["E", "S"]: 333 | chunk = (chunk_type, chunk_start, i) 334 | chunks.append(chunk) 335 | if tok_chunk_class in ["B", "S"]: 336 | chunk_class, chunk_type, chunk_start = tok_chunk_class, tok_chunk_type, i 337 | else: 338 | chunk_class, chunk_type, chunk_start = None, None, None 339 | elif tok_chunk_type == chunk_type and chunk_class in ["B", "I"]: 340 | chunk_class = tok_chunk_class 341 | else: 342 | chunk_class, chunk_type = None, None 343 | else: # BIO schema 344 | if tok_chunk_class == "B": 345 | chunk = (chunk_type, chunk_start, i) 346 | chunks.append(chunk) 347 | chunk_class, chunk_type, chunk_start = tok_chunk_class, tok_chunk_type, i 348 | else: 349 | chunk_class, chunk_type = None, None 350 | 351 | if chunk_type is not None: 352 | chunk = (chunk_type, chunk_start, len(seq)) 353 | chunks.append(chunk) 354 | 355 | return chunks 356 | 357 | 358 | def get_vectors(embed, vocab, pretrain_embed_vocab): 359 | oov = 0 360 | for i, word in enumerate(vocab.itos): 361 | index = pretrain_embed_vocab.stoi.get(word, None) # digit or None 362 | if index is None: 363 | if word.lower() in pretrain_embed_vocab.stoi: 364 | index = pretrain_embed_vocab.stoi[word.lower()] 365 | if index: 366 | embed[i] = pretrain_embed_vocab.vectors[index] 367 | else: 368 | oov += 1 369 | print('train vocab oov %d \ntrain vocab + dev ootv + test ootv: %d' % (oov, len(vocab.stoi))) 370 | return embed 371 | 372 | 373 | def test_get_chunks(): 374 | print(get_chunks([4, 2, 1, 2, 3, 3], 375 | {'O': 0, "B-PER": 1, "I-PER": 2, "E-PER": 3, "S-PER": 4})) 376 | print(get_chunks(["S-PER", "I-PER", "B-PER", "I-PER", "E-PER", "E-PER"], 377 | {'O': 0, "B-PER": 1, "I-PER": 2, "E-PER": 3, "S-PER": 4}, id_format=False)) 378 | -------------------------------------------------------------------------------- /Task5-Language Model/data/poetryFromTang.txt: -------------------------------------------------------------------------------- 1 | 巴山上峡重复重,阳台碧峭十二峰。荆王猎时逢暮雨, 2 | 夜卧高丘梦神女。轻红流烟湿艳姿,行云飞去明星稀。 3 | 目极魂断望不见,猿啼三声泪沾衣。 4 | 5 | 见尽数万里,不闻三声猿。但飞萧萧雨,中有亭亭魂。 6 | 千载楚襄恨,遗文宋玉言。至今青冥里,云结深闺门。 7 | 8 | 碧丛丛,高插天,大江翻澜神曳烟。楚魂寻梦风飔然, 9 | 晓风飞雨生苔钱。瑶姬一去一千年,丁香筇竹啼老猿。 10 | 古祠近月蟾桂寒,椒花坠红湿云间。 11 | 12 | 巫山高,巫女妖,雨为暮兮云为朝,楚王憔悴魂欲销。 13 | 秋猿嗥嗥日将夕,红霞紫烟凝老壁。千岩万壑花皆坼, 14 | 但恐芳菲无正色。不知今古行人行,几人经此无秋情。 15 | 云深庙远不可觅,十二峰头插天碧。 16 | 17 | 君不见黄河之水天上来,奔流到海不复回。 18 | 君不见高堂明镜悲白发,朝如青丝暮成雪。 19 | 人生得意须尽欢,莫使金尊空对月。天生我材必有用, 20 | 千金散尽还复来。烹羊宰牛且为乐,会须一饮三百杯。 21 | 岑夫子,丹丘生,将进酒,杯莫停。与君歌一曲, 22 | 请君为我侧耳听。钟鼓馔玉不足贵,但愿长醉不复醒。 23 | 古来圣贤皆寂寞,惟有饮者留其名。陈王昔时宴平乐, 24 | 斗酒十千恣欢谑。主人何为言少钱,径须酤取对君酌。 25 | 五花马,千金裘,呼儿将出换美酒,与尔同销万古愁。 26 | 27 | 将进酒,将进酒,酒中有毒鸩主父,言之主父伤主母。 28 | 母为妾地父妾天,仰天俯地不忍言。佯为僵踣主父前, 29 | 主父不知加妾鞭。旁人知妾为主说,主将泪洗鞭头血。 30 | 推摧主母牵下堂,扶妾遣升堂上床。将进酒, 31 | 酒中无毒令主寿,愿主回思归主母,遣妾如此事主父。 32 | 妾为此事人偶知,自惭不密方自悲。主今颠倒安置妾, 33 | 贪天僭地谁不为。 34 | 35 | 琉璃钟,琥珀浓,小槽酒滴真珠红。烹龙炮凤玉脂泣, 36 | 罗屏绣幕围香风。吹龙笛,击鼍鼓,皓齿歌,细腰舞。 37 | 况是青春日将暮,桃花乱落如红雨。劝君终日酩酊醉, 38 | 酒不到刘伶坟上土。 39 | 40 | 君马黄,我马白,马色虽不同,人心本无隔。 41 | 共作游冶盘,双行洛阳陌。长剑既照曜,高冠何赩赫。 42 | 各有千金裘,俱为五侯客。猛虎落陷阱,壮夫时屈厄。 43 | 相知在急难,独好亦何益。 44 | 45 | 何地早芳菲,宛在长门殿。夭桃色若绶,秾李光如练。 46 | 啼鸟弄花疏,游蜂饮香遍。叹息春风起,飘零君不见。 47 | 48 | 芳树本多奇,年华复在斯。结翠成新幄,开红满旧枝。 49 | 风归花历乱,日度影参差。容色朝朝落,思君君不知。 50 | 51 | 玉花珍簟上,金缕画屏开。晓月怜筝柱,春风忆镜台。 52 | 筝柱春风吹晓月,芳树落花朝暝歇。稿砧刀头未有时, 53 | 攀条拭泪坐相思。 54 | 55 | 迢迢芳园树,列映清池曲。对此伤人心,还如故时绿。 56 | 风条洒馀霭,露叶承新旭。佳人不再攀,下有往来躅。 57 | 58 | 芳树已寥落,孤英尤可嘉。可怜团团叶,盖覆深深花。 59 | 游蜂竞攒刺,斗雀亦纷拏。天生细碎物,不爱好光华。 60 | 非无歼殄法,念尔有生涯。春雷一声发,惊燕亦惊蛇。 61 | 清池养神蔡,已复长虾蟆。雨露贵平施,吾其春草芽。 62 | 63 | 细蕊慢逐风,暖香闲破鼻。青帝固有心,时时动人意。 64 | 去年高枝犹压地,今年低枝已憔悴。 65 | 吾所以见造化之权,变通之理。春夏作头,秋冬为尾。 66 | 循环反复无穷已。今生长短同一轨,若使威可以制, 67 | 力可以止,秦皇不肯敛手下沙丘,孟贲不合低头入蒿里。 68 | 伊人强猛犹如此,顾我劳生何足恃。但愿开素袍, 69 | 倾绿蚁,陶陶兀兀大醉于青冥白昼间。任他上是天, 70 | 下是地。 71 | 72 | 君子事行役,再空芳岁期。美人旷延伫,万里浮云思。 73 | 园槿绽红艳,郊桑柔绿滋。坐看长夏晚,秋月生罗帏。 74 | 75 | 我思仙人,乃在碧海之东隅。 76 | 海寒多天风,白波连山倒蓬壶。长鲸喷涌不可涉, 77 | 抚心茫茫泪如珠。西来青鸟东飞去,愿寄一书谢麻姑。 78 | 79 | 桔槔烽火昼不灭,客路迢迢信难越。古镇刀攒万片霜, 80 | 寒江浪起千堆雪。此时西去定如何,空使南心远凄切。 81 | 82 | 当时我醉美人家,美人颜色娇如花。今日美人弃我去, 83 | 青楼珠箔天之涯。天涯娟娟常娥月,三五二八盈又缺。 84 | 翠眉蝉鬓生别离,一望不见心断绝。心断绝,几千里, 85 | 梦中醉卧巫山云,觉来泪滴湘江水。湘江两岸花木深, 86 | 美人不见愁人心。含愁更奏绿绮琴,调高弦绝无知音。 87 | 美人兮美人,不知为暮雨兮为朝云,相思一夜梅花发, 88 | 忽到窗前疑是君。 89 | 90 | 借问江上柳,青青为谁春。空游昨日地,不见昨日人。 91 | 缭绕万家井,往来车马尘。莫道无相识,要非心所亲。 92 | 93 | 朝亦有所思,暮亦有所思。登楼望君处,蔼蔼浮云飞。 94 | 浮云遮却阳关道,向晚谁知妾怀抱。玉井苍苔春院深, 95 | 桐花落地无人扫。 96 | 97 | 辟邪伎作鼓吹惊,雉子班之奏曲成,喔咿振迅欲飞鸣。 98 | 扇锦翼,雄风生,双雌同饮啄,趫悍谁能争。 99 | 乍向草中耿介死,不求黄金笼下生。 100 | 所贵旷士怀,朗然合太清。 101 | 102 | 高台暂俯临,飞翼耸轻音。浮光随日度,漾影逐波深。 103 | 迥瞰周平野,开怀畅远襟。独此三休上,还伤千岁心。 104 | 105 | 临高台,高台迢递绝浮埃,瑶轩绮构何崔嵬, 106 | 鸾歌凤吹清且哀。俯瞰长安道,萋萋御沟草, 107 | 斜对甘泉路,苍苍茂陵树。高台四望同, 108 | 帝乡佳气郁葱葱。紫阁丹楼纷照曜,璧房锦殿相玲珑。 109 | 东弥长乐观,西指未央宫。赤城映朝日,绿树摇春风。 110 | 旗亭百队开新市,甲第千甍分戚里。朱轮翠盖不胜春, 111 | 叠树层楹相对起。复有青楼大道中,绣户文窗雕绮栊。 112 | 锦衣昼不襞,罗帏夕未空。歌屏朝掩翠,妆镜晚窥红。 113 | 为吾安宝髻,蛾眉罢花丛。狭路尘间黯将暮, 114 | 云间月色明如素。鸳鸯池上两两飞,凤凰楼下双双度。 115 | 物色正如此,佳期那不顾。银鞍绣毂盛繁华, 116 | 可怜今夜宿倡家。倡家少妇不须嚬,东园桃李片时春。 117 | 君看旧日高台处,柏梁铜雀生黄尘。 118 | 119 | 凉风吹远念,使我升高台。宁知数片云,不是旧山来。 120 | 故人天一涯,久客殊未回。雁来不得书,空寄声哀哀。 121 | 122 | 穿屋穿墙不知止,争树争巢入营死。林间公子挟弹弓, 123 | 一丸致毙花丛里。小雏黄口未有知,青天不解高高飞。 124 | 虞人设网当要路,白日啾嘲祸万机。 125 | 126 | 朝日敛红烟,垂竿向绿川。人疑天上坐,鱼似镜中悬。 127 | 避楫时惊透,猜钩每误牵。湍危不理辖,潭静欲留船。 128 | 钓玉君徒尚,征金我未贤。为看芳饵下,贪得会无全。 129 | 130 | 汉将承恩西破戎,捷书先奏未央宫。 131 | 天子预开麟阁待,只今谁数贰师功。 132 | 133 | 官军西出过楼兰,营幕傍临月窟寒。 134 | 蒲海晓霜凝剑尾,葱山夜雪扑旌竿。 135 | 136 | 鸣笳擂鼓拥回军,破国平蕃昔未闻。 137 | 大夫鹊印摇边月,天将龙旗掣海云。 138 | 139 | 月落辕门鼓角鸣,千群面缚出蕃城。 140 | 洗兵鱼海云迎阵,秣马龙堆月照营。 141 | 142 | 蕃军遥见汉家营,满谷连山遍哭声。 143 | 万箭千刀一夜杀,平明流血浸空城。 144 | 145 | 暮雨旌旗湿未干,胡尘白草日光寒。 146 | 昨夜将军连晓战,蕃军只见马空鞍。 147 | 148 | 晋阳武,奋义威。炀之渝,德焉归。氓毕屠,绥者谁。 149 | 皇烈烈,专天机。号以仁,扬其旗。日之升,九土晞。 150 | 斥田圻,流洪辉。有其二,翼馀隋。斫枭骜,连熊螭。 151 | 枯以肉,勍者羸。后土荡,玄穹弥。合之育,莽然施。 152 | 惟德辅,庆无期。 153 | 154 | 兽之穷,奔大麓。天厚黄德,狙犷服。 155 | 甲之櫜弓,弭矢箙。皇旅靖,敌逾蹙。 156 | 自亡其徒,匪予戮。屈rH猛,虔栗栗。 157 | 縻以尺组,啖以秩。黎之阳,土茫茫。 158 | 富兵戎,盈仓箱。乏者德,莫能享。驱豺兕,授我疆。 159 | 160 | 战武牢,动河朔。逆之助,图掎角。怒鷇麛,抗乔岳。 161 | 翘萌牙,傲霜雹。王谋内定,申掌握。铺施芟夷, 162 | 二主缚。惮华戎,廓封略。命之瞢,卑以斫。归有德, 163 | 唯先觉。 164 | 165 | 泾水黄,陇野茫。负太白,腾天狼。有鸟鸷立,羽翼张。 166 | 钩喙决前,钜趯傍;怒飞饥啸,翾不可当。老雄死, 167 | 子复良。巢岐饮渭,肆翱翔。顿地纮,提天纲。 168 | 列缺掉帜,招摇耀铓。鬼神来助,梦嘉祥。脑涂原野, 169 | 魂飞扬。星辰复,恢一方。 170 | 171 | 奔鲸沛,荡海垠。吐霓翳日,腥浮云。帝怒下顾, 172 | 哀垫昏。授以神柄,推元臣。手援天矛,截修鳞。 173 | 披攘蒙霿,开海门。地平水静,浮天垠。羲和显耀, 174 | 乘清氛。赫炎溥畅,融大钧。 175 | 176 | 苞枿ba矣,惟根之蟠。弥巴蔽荆,负南极以安。 177 | 曰我旧梁氏,辑绥艰难。江汉之阻,都邑固以完。 178 | 圣人作,神武用,有臣勇智,奋不以众。投迹死地, 179 | 谋猷纵。化敌为家,虑则中。浩浩海裔,不威而同。 180 | 系缧降王,定厥功。澶漫万里,宣唐风。蛮夷九译, 181 | 咸来从。凯旋金奏,象形容。震赫万国,罔不龚。 182 | 183 | 河右澶漫,顽为之魁。王师如雷震,昆仑以颓。 184 | 上聋下聪,骜不可回。助仇抗有德,惟人之灾。 185 | 乃溃乃奋,执缚归厥命。万室蒙其仁,一夫则病。 186 | 濡以鸿泽,皇之圣。威畏德怀,功以定。顺之于理, 187 | 物咸遂厥性。 188 | 189 | 铁山碎,大漠舒。二虏劲,连穹庐。背北海,专坤隅。 190 | 岁来侵边,或傅于都。天子命元帅,奋其雄图。 191 | 破定襄,降魁渠。穷竟窟宅,斥余吾。百蛮破胆, 192 | 边氓苏。威武辉耀,明鬼区。利泽弥万祀,功不可逾。 193 | 官臣拜手,惟帝之谟。 194 | 195 | 本邦伊晋,惟时不靖。根柢之摇,枝叶攸病。 196 | 守臣不任,勩于神圣。惟钺之兴,翦焉则定。 197 | 洪惟我理,式和以敬。群顽既夷,庶绩咸正。 198 | 皇谟载大,惟人之庆。 199 | 200 | 吐谷浑盛强,背西海以夸。岁侵扰我疆,退匿险且遐。 201 | 帝谓神武师,往征靖皇家。烈烈旆其旗,熊虎杂龙蛇。 202 | 王旅千万人,衔枚默无哗。束刃逾山徼,张翼纵漠沙。 203 | 一举刈膻腥,尸骸积如麻。除恶务本根,况敢遗萌芽。 204 | 洋洋西海水,威命穷天涯。系虏来王都,犒乐穷休嘉。 205 | 登高望还师,竟野如春华。行者靡不归,亲戚讙要遮。 206 | 凯旋献清庙,万国思无邪。 207 | 208 | 麹氏雄西北,别绝臣外区。既恃远且险,纵傲不我虞。 209 | 烈烈王者师,熊螭以为徒。龙旂翻海浪,馹骑驰坤隅。 210 | 贲育搏婴儿,一扫不复馀。平沙际天极,但见黄云驱。 211 | 臣靖执长缨,智勇伏囚拘。文皇南面坐,夷狄千群趋。 212 | 咸称天子神,往古不得俱。献号天可汗,以覆我国都。 213 | 兵戎不交害,各保性与躯。 214 | 215 | 东蛮有谢氏,冠带理海中。已言我异世,虽圣莫能通。 216 | 王卒如飞翰,鹏鶱骇群龙。轰然自天坠,乃信神武功。 217 | 系虏君臣人,累累来自东。无思不服从,唐业如山崇。 218 | 百辟拜稽首,咸愿图形容。如周王会书,永永传无穷。 219 | 睢盱万状乖,咿嗢九译重。广轮抚四海,浩浩知皇风。 220 | 歌诗铙鼓间,以壮我元戎。 221 | 222 | 片玉来夸楚,治中作主人。江山增润色,词赋动阳春。 223 | 别馆当虚敞,离情任吐伸。因声两京旧,谁念卧漳滨。 224 | 225 | 楚关望秦国,相去千里馀。州县勤王事,山河转使车。 226 | 祖筵江上列,离恨别前书。愿及芳年赏,娇莺二月初。 227 | 228 | 早闻牛渚咏,今见鶺鴒心。羽翼嗟零落,悲鸣别故林。 229 | 苍梧白云远,烟水洞庭深。万里独飞去,南风迟尔音。 230 | 231 | 旧国余归楚,新年子北征。挂帆愁海路,分手恋朋情。 232 | 日夕故园意,汀洲春草生。何时一杯酒,重与季鹰倾。 233 | 234 | 吾道昧所适,驱车还向东。主人开旧馆,留客醉新丰。 235 | 树绕温泉绿,尘遮晚日红。拂衣从此去,高步蹑华嵩。 236 | 237 | 何幸遇休明,观光来上京。相逢武陵客,独送豫章行。 238 | 随牒牵黄绶,离群会墨卿。江南佳丽地,山水旧难名。 239 | 240 | 南国辛居士,言归旧竹林。未逢调鼎用,徒有济川心。 241 | 予亦忘机者,田园在汉阴。因君故乡去,遥寄式微吟。 242 | 243 | 惜尔怀其宝,迷邦倦客游。江山历全楚,河洛越成周。 244 | 道路疲千里,乡园老一丘。知君命不偶,同病亦同忧。 245 | 246 | 奉使推能者,勤王不暂闲。观风随按察,乘骑度荆关。 247 | 送别登何处,开筵旧岘山。征轩明日远,空望郢门间。 248 | 249 | 导漾自嶓冢,东流为汉川。维桑君有意,解缆我开筵。 250 | 云雨从兹别,林端意渺然。尺书能不吝,时望鲤鱼传。 251 | 252 | 西上游江西,临流恨解携。千山叠成嶂,万水泻为溪。 253 | 石浅流难溯,藤长险易跻。谁怜问津者,岁晏此中迷。 254 | 255 | 士有不得志,栖栖吴楚间。广陵相遇罢,彭蠡泛舟还。 256 | 樯出江中树,波连海上山。风帆明日远,何处更追攀。 257 | 258 | 献策金门去,承欢彩服违。以吾一日长,念尔聚星稀。 259 | 昏定须温席,寒多未授衣。桂枝如已擢,早逐雁南飞。 260 | 261 | 白日既云暮,朱颜亦已酡。画堂初点烛,金幌半垂罗。 262 | 长袖平阳曲,新声子夜歌。从来惯留客,兹夕为谁多。 263 | 264 | 侧听弦歌宰,文书游夏徒。故园欣赏竹,为邑幸来苏。 265 | 华省曾联事,仙舟复与俱。欲知临泛久,荷露渐成珠。 266 | 267 | 邑有弦歌宰,翔鸾狎野鸥。眷言华省旧,暂拂海池游。 268 | 郁岛藏深竹,前溪对舞楼。更闻书即事,云物是清秋。 269 | 270 | 甲第开金穴,荣期乐自多。枥嘶支遁马,池养右军鹅。 271 | 竹引携琴入,花邀载酒过。山公来取醉,时唱接z5歌。 272 | 273 | 言避一时暑,池亭五月开。喜逢金马客,同饮玉人杯。 274 | 舞鹤乘轩至,游鱼拥钓来。座中殊未起,箫管莫相催。 275 | 276 | 林卧愁春尽,开轩览物华。忽逢青鸟使,邀入赤松家。 277 | 丹灶初开火,仙桃正落花。童颜若可驻,何惜醉流霞。 278 | 279 | 瑞雪初盈尺,寒宵始半更。列筵邀酒伴,刻烛限诗成。 280 | 香炭金炉暖,娇弦玉指清。醉来方欲卧,不觉晓鸡鸣。 281 | 282 | 世业传珪组,江城佐股肱。高斋征学问,虚薄滥先登。 283 | 讲论陪诸子,文章得旧朋。士元多赏激,衰病恨无能。 284 | 285 | 人事有代谢,往来成古今。江山留胜迹,我辈复登临。 286 | 水落鱼梁浅,天寒梦泽深。羊公碑字在,读罢泪沾襟。 287 | 288 | 水楼一登眺,半出青林高。帟幕英僚敞,芳筵下客叨。 289 | 山藏伯禹穴,城压伍胥涛。今日观溟涨,垂纶学钓鳌。 290 | 291 | 吾友太乙子,餐霞卧赤城。欲寻华顶去,不惮恶溪名。 292 | 歇马凭云宿,扬帆截海行。高高翠微里,遥见石梁横。 293 | 294 | 秋入诗人意,巴歌和者稀。泛湖同逸旅,吟会是思归。 295 | 白简徒推荐,沧洲已拂衣。杳冥云外去,谁不羡鸿飞。 296 | 297 | 挂席几千里,名山都未逢。泊舟浔阳郭,始见香炉峰。 298 | 尝读远公传,永怀尘外踪。东林精舍近,日暮但闻钟。 299 | 300 | 独步人何在,嵩阳有故楼。岁寒问耆旧,行县拥诸侯。 301 | 林莽北弥望,沮漳东会流。客中遇知己,无复越乡忧。 302 | 303 | 武陵川路狭,前棹入花林。莫测幽源里,仙家信几深。 304 | 水回青嶂合,云度绿溪阴。坐听闲猿啸,弥清尘外心。 305 | 306 | 百里闻雷震,鸣弦暂辍弹。府中连骑出,江上待潮观。 307 | 照日秋云迥,浮天渤澥宽。惊涛来似雪,一坐凛生寒。 308 | 309 | 主人新邸第,相国旧池台。馆是招贤辟,楼因教舞开。 310 | 轩车人已散,箫管凤初来。今日龙门下,谁知文举才。 311 | 312 | 水亭凉气多,闲棹晚来过。涧影见松竹,潭香闻芰荷。 313 | 野童扶醉舞,山鸟助酣歌。幽赏未云遍,烟光奈夕何。 314 | 315 | 故人来自远,邑宰复初临。执手恨为别,同舟无异心。 316 | 沿洄洲渚趣,演漾弦歌音。谁识躬耕者,年年梁甫吟。 317 | 318 | 共喜年华好,来游水石间。烟容开远树,春色满幽山。 319 | 壶酒朋情洽,琴歌野兴闲。莫愁归路暝,招月伴人还。 320 | 321 | 南纪西江阔,皇华御史雄。截流宁假楫,挂席自生风。 322 | 僚寀争攀鹢,鱼龙亦避骢。坐听白雪唱,翻入棹歌中。 323 | 324 | 万山青嶂曲,千骑使君游。神女鸣环佩,仙郎接献酬。 325 | 遍观云梦野,自爱江城楼。何必东南守,空传沈隐侯。 326 | 327 | 海亭秋日望,委曲见江山。染翰聊题壁,倾壶一解颜。 328 | 歌逢彭泽令,归赏故园间。予亦将琴史,栖迟共取闲。 329 | 330 | 河县柳林边,河桥晚泊船。文叨才子会,官喜故人连。 331 | 笑语同今夕,轻肥异往年。晨风理归棹,吴楚各依然。 332 | 333 | 傲吏非凡吏,名流即道流。隐居不可见,高论莫能酬。 334 | 水接仙源近,山藏鬼谷幽。再来迷处所,花下问渔舟。 335 | 336 | 龙象经行处,山腰度石关。屡迷青嶂合,时爱绿萝闲。 337 | 宴息花林下,高谈竹屿间。寥寥隔尘事,疑是入鸡山。 338 | 339 | 欣逢柏台友,共谒聪公禅。石室无人到,绳床见虎眠。 340 | 阴崖常抱雪,枯涧为生泉。出处虽云异,同欢在法筵。 341 | 342 | 出谷未停午,到家日已曛。回瞻下山路,但见牛羊群。 343 | 樵子暗相失,草虫寒不闻。衡门犹未掩,伫立望夫君。 344 | 345 | 夏日茅斋里,无风坐亦凉。竹林深笋穊,藤架引梢长。 346 | 燕觅巢窠处,蜂来造蜜房。物华皆可玩,花蕊四时芳。 347 | 348 | 释子弥天秀,将军武库才。横行塞北尽,独步汉南来。 349 | 贝叶传金口,山楼作赋开。因君振嘉藻,江楚气雄哉。 350 | 351 | 误入桃源里,初怜竹径深。方知仙子宅,未有世人寻。 352 | 舞鹤过闲砌,飞猿啸密林。渐通玄妙理,深得坐忘心。 353 | 354 | 支遁初求道,深公笑买山。何如石岩趣,自入户庭间。 355 | 苔涧春泉满,萝轩夜月闲。能令许玄度,吟卧不知还。 356 | 357 | 人事一朝尽,荒芜三径休。始闻漳浦卧,奄作岱宗游。 358 | 池水犹含墨,风云已落秋。今宵泉壑里,何处觅藏舟。 359 | 360 | 彭泽先生柳,山阴道士鹅。我来从所好,停策汉阴多。 361 | 重以观鱼乐,因之鼓枻歌。崔徐迹未朽,千载揖清波。 362 | 363 | 带雪梅初暖,含烟柳尚青。来窥童子偈,得听法王经。 364 | 会理知无我,观空厌有形。迷心应觉悟,客思未遑宁。 365 | 366 | 给园支遁隐,虚寂养身和。春晚群木秀,间关黄鸟歌。 367 | 林栖居士竹,池养右军鹅。炎月北窗下,清风期再过。 368 | 369 | 义公习禅处,结构依空林。户外一峰秀,阶前群壑深。 370 | 夕阳连雨足,空翠落庭阴。看取莲花净,应知不染心。 371 | 372 | 白鹤青岩半,幽人有隐居。阶庭空水石,林壑罢樵渔。 373 | 岁月青松老,风霜苦竹疏。睹兹怀旧业,回策返吾庐。 374 | 375 | 精舍买金开,流泉绕砌回。芰荷薰讲席,松柏映香台。 376 | 法雨晴飞去,天花昼下来。谈玄殊未已,归骑夕阳催。 377 | 378 | 池上青莲宇,林间白马泉。故人成异物,过客独潸然。 379 | 既礼新松塔,还寻旧石筵。平生竹如意,犹挂草堂前。 380 | 381 | 与君园庐并,微尚颇亦同。耕钓方自逸,壶觞趣不空。 382 | 门无俗士驾,人有上皇风。何处先贤传,惟称庞德公。 383 | 384 | 弱岁早登龙,今来喜再逢。如何春月柳,犹忆岁寒松。 385 | 烟火临寒食,笙歌达曙钟。喧喧斗鸡道,行乐羡朋从。 386 | 387 | 闻就庞公隐,移居近洞湖。兴来林是竹,归卧谷名愚。 388 | 挂席樵风便,开轩琴月孤。岁寒何用赏,霜落故园芜。 389 | 390 | 府僚能枉驾,家酝复新开。落日池上酌,清风松下来。 391 | 厨人具鸡黍,稚子摘杨梅。谁道山公醉,犹能骑马回。 392 | 393 | 折戟沈沙铁未销,自将磨洗认前朝。 394 | 东风不与周郎便,铜雀春深锁二乔。 395 | 396 | 日旗龙旆想飘扬,一索功高缚楚王。 397 | 直是超然五湖客,未如终始郭汾阳。 398 | 399 | 贱子来千里,明公去一麾。可能休涕泪,岂独感恩知。 400 | 草木穷秋后,山川落照时。如何望故国,驱马却迟迟。 401 | 402 | 一笑五云溪上舟,跳丸日月十经秋。鬓衰酒减欲谁泥, 403 | 迹辱魂惭好自尤。梦寐几回迷蛱蝶,文章应解伴牢愁。 404 | 无穷尘土无聊事,不得清言解不休。 405 | 406 | 烟笼寒水月笼沙,夜泊秦淮近酒家。 407 | 商女不知亡国恨,隔江犹唱后庭花。 408 | 409 | 萧萧山路穷秋雨,淅淅溪风一岸蒲。 410 | 为问寒沙新到雁,来时还下杜陵无。 411 | 412 | 细腰宫里露桃新,脉脉无言度几春。 413 | 至竟息亡缘底事,可怜金谷坠楼人。 414 | 415 | 雪涨前溪水,啼声已绕滩。梅衰未减态,春嫩不禁寒。 416 | 迹去梦一觉,年来事百般。闻君亦多感,何处倚阑干。 417 | 418 | 平生自许少尘埃,为吏尘中势自回。朱绂久惭官借与, 419 | 白题还叹老将来。须知世路难轻进,岂是君门不大开。 420 | 霄汉几多同学伴,可怜头角尽卿材。 421 | 422 | 缄书报子玉,为我谢平津。自愧扫门士,谁为乞火人。 423 | 词臣陪羽猎,战将骋骐驎。两地差池恨,江汀醉送君。 424 | 425 | 芳草渡头微雨时,万株杨柳拂波垂。蒲根水暖雁初浴, 426 | 梅径香寒蜂未知。辞客倚风吟暗淡,使君回马湿旌旗。 427 | 江南仲蔚多情调,怅望春阴几首诗。 428 | 429 | 江湖醉渡十年春,牛渚山边六问津。 430 | 历阳前事知何实,高位纷纷见陷人。 431 | 432 | 胜败兵家事不期,包羞忍耻是男儿。 433 | 434 | 孙家兄弟晋龙骧,驰骋功名业帝王。 435 | 至竟江山谁是主,苔矶空属钓鱼郎。 436 | 437 | 发匀肉好生春岭,截玉钻星寄使君。檀的染时痕半月, 438 | 落梅飘处响穿云。楼中威凤倾冠听,沙上惊鸿掠水分。 439 | 遥想紫泥封诏罢,夜深应隔禁墙闻。 440 | 441 | 青山隐隐水迢迢,秋尽江南草木凋。 442 | 二十四桥明月夜,玉人何处教吹箫。 443 | 444 | 故人别来面如雪,一榻拂云秋影中。 445 | 玉白花红三百首,五陵谁唱与春风。 446 | 447 | 贾傅松醪酒,秋来美更香。怜君片云思,一去绕潇湘。 448 | 449 | 一渠东注芳华苑,苑锁池塘百岁空。 450 | 水殿半倾蟾口涩,为谁流下蓼花中。 451 | 452 | 锦缆龙舟隋炀帝,平台复道汉梁王。 453 | 游人闲起前朝念,折柳孤吟断杀肠。 454 | 455 | 千里长河初冻时,玉珂瑶珮响参差。 456 | 浮生却似冰底水,日夜东流人不知。 457 | 458 | 七子论诗谁似公,曹刘须在指挥中。荐衡昔日知文举, 459 | 乞火无人作蒯通。北极楼台长挂梦,西江波浪远吞空。 460 | 可怜故国三千里,虚唱歌词满六宫。 461 | 462 | 大夫官重醉江东,潇洒名儒振古风。文石陛前辞圣主。 463 | 碧云天外作冥鸿。五言宁谢颜光禄,百岁须齐卫武公。 464 | 再拜宜同丈人行,过庭交分有无同。 465 | 466 | 水接西江天外声,小斋松影拂云平。 467 | 何人教我吹长笛,与倚春风弄月明。 468 | 469 | 广文遗韵留樗散,鸡犬图书共一船。 470 | 自说江湖不归事,阻风中酒过年年。 471 | 472 | 三吴裂婺女,九锡狱孤儿。霸主业未半,本朝心是谁。 473 | 永安宫受诏,筹笔驿沉思。画地乾坤在,濡毫胜负知。 474 | 艰难同草创,得失计毫厘。寂默经千虑,分明浑一期。 475 | 川流萦智思,山耸助扶持。慷慨匡时略,从容问罪师。 476 | 褒中秋鼓角,渭曲晚旌旗。仗义悬无敌,鸣攻故有辞。 477 | 若非天夺去,岂复虑能支。子夜星才落,鸿毛鼎便移。 478 | 邮亭世自换,白日事长垂。何处躬耕者,犹题殄瘁诗。 479 | 480 | 邮亭寄人世,人世寄邮亭。何如自筹度,鸿路有冥冥。 481 | 482 | 少微星动照春云,魏阙衡门路自分。 483 | 倏去忽来应有意,世间尘土谩疑君。 484 | 485 | 调高银字声还侧,物比柯亭韵校奇。 486 | 寄与玉人天上去,桓将军见不教吹。 487 | 488 | 历阳崔太守,何日不含情。恩义同钟李,埙篪实弟兄。 489 | 光尘能混合,擘画最分明。台阁仁贤誉,闺门孝友声。 490 | 西方像教毁,南海绣衣行。金橐宁回顾,珠簟肯一枨。 491 | 只宜裁密诏,何自取专城。进退无非道,徊翔必有名。 492 | 好风初婉软,离思苦萦盈。金马旧游贵,桐庐春水生。 493 | 雨侵寒牖梦,梅引冻醪倾。共祝中兴主,高歌唱太平 494 | 纨袴不饿死,儒冠多误身。丈人试静听,贱子请具陈。 495 | 甫昔少年日,早充观国宾。读书破万卷,下笔如有神。 496 | 赋料扬雄敌,诗看子建亲。李邕求识面,王翰愿卜邻。 497 | 自谓颇挺出,立登要路津。致君尧舜上,再使风俗淳。 498 | 此意竟萧条,行歌非隐沦。骑驴三十载,旅食京华春。 499 | 朝扣富儿门,暮随肥马尘。残杯与冷炙,到处潜悲辛。 500 | 主上顷见征,欻然欲求伸。青冥却垂翅,蹭蹬无纵鳞。 501 | 甚愧丈人厚,甚知丈人真。每于百僚上,猥诵佳句新。 502 | 窃效贡公喜,难甘原宪贫。焉能心怏怏,只是走踆踆。 503 | 今欲东入海,即将西去秦。尚怜终南山,回首清渭滨。 504 | 常拟报一饭,况怀辞大臣。白鸥没浩荡,万里谁能驯。 505 | 506 | 崆峒小麦熟,且愿休王师。请公问主将,焉用穷荒为。 507 | 饥鹰未饱肉,侧翅随人飞。高生跨鞍马,有似幽并儿。 508 | 脱身簿尉中,始与捶楚辞。借问今何官,触热向武威。 509 | 答云一书记,所愧国士知。人实不易知,更须慎其仪。 510 | 十年出幕府,自可持旌麾。此行既特达,足以慰所思。 511 | 男儿功名遂,亦在老大时。常恨结欢浅,各在天一涯。 512 | 又如参与商,惨惨中肠悲。惊风吹鸿鹄,不得相追随。 513 | 黄尘翳沙漠,念子何当归。边城有馀力,早寄从军诗。 514 | 二年客东都,所历厌机巧。野人对膻腥,蔬食常不饱。 515 | 岂无青精饭,使我颜色好。苦乏大药资,山林迹如扫。 516 | 李侯金闺彦,脱身事幽讨。亦有梁宋游,方期拾瑶草。 517 | 已从招提游,更宿招提境。阴壑生虚籁,月林散清影。 518 | 天阙象纬逼,云卧衣裳冷。欲觉闻晨钟,令人发深省。 519 | 520 | 岱宗夫如何,齐鲁青未了。造化钟神秀,阴阳割昏晓。 521 | 荡胸生曾云,决眦入归鸟。会当凌绝顶,一览众山小。 522 | 东藩驻皂盖,北渚凌青荷。海内此亭古,济南名士多。 523 | 云山已发兴,玉佩仍当歌。修竹不受暑,交流空涌波。 524 | 蕴真惬所遇,落日将如何。贵贱俱物役,从公难重过。 525 | 新亭结构罢,隐见清湖阴。迹籍台观旧,气溟海岳深。 526 | 圆荷想自昔,遗堞感至今。芳宴此时具,哀丝千古心。 527 | 主称寿尊客,筵秩宴北林。不阻蓬荜兴,得兼梁甫吟。 528 | 故人昔隐东蒙峰,已佩含景苍精龙。故人今居子午谷, 529 | 独在阴崖结茅屋。屋前太古玄都坛,青石漠漠常风寒。 530 | 子规夜啼山竹裂,王母昼下云旗翻。知君此计成长往, 531 | 芝草琅玕日应长。铁锁高垂不可攀,致身福地何萧爽。 532 | 533 | 今夕何夕岁云徂,更长烛明不可孤。咸阳客舍一事无, 534 | 相与博塞为欢娱。冯陵大叫呼五白,袒跣不肯成枭卢。 535 | 英雄有时亦如此,邂逅岂即非良图。 536 | 君莫笑刘毅从来布衣愿,家无儋石输百万。 537 | 538 | 翻手作云覆手雨,纷纷轻薄何须数。君不见管鲍贫时交, 539 | 此道今人弃如土。 540 | 541 | 车辚辚,马萧萧,行人弓箭各在腰。耶娘妻子走相送, 542 | 尘埃不见咸阳桥。牵衣顿足阑道哭,哭声直上干云霄。 543 | 道傍过者问行人,行人但云点行频。或从十五北防河, 544 | 便至四十西营田。去时里正与裹头,归来头白还戍边。 545 | 边亭流血成海水,武皇开边意未已。 546 | 君不闻汉家山东二百州,千村万落生荆杞。 547 | 纵有健妇把锄犁,禾生陇亩无东西。况复秦兵耐苦战, 548 | 被驱不异犬与鸡。长者虽有问,役夫敢申恨。 549 | 且如今年冬,未休关西卒。县官急索租,租税从何出。 550 | 信知生男恶,反是生女好。生女犹是嫁比邻, 551 | 生男埋没随百草。君不见青海头,古来白骨无人收。 552 | 新鬼烦冤旧鬼哭,天阴雨湿声啾啾。 553 | 554 | 安西都护胡青骢,声价欻然来向东。此马临阵久无敌, 555 | 与人一心成大功。功成惠养随所致,飘飘远自流沙至。 556 | 雄姿未受伏枥恩,猛气犹思战场利。腕促蹄高如踣铁, 557 | 交河几蹴曾冰裂。五花散作云满身,万里方看汗流血。 558 | 长安壮儿不敢骑,走过掣电倾城知。青丝络头为君老, 559 | 何由却出横门道。 560 | 561 | 吾闻天子之马走千里,今之画图无乃是。 562 | 是何意态雄且杰,骏尾萧梢朔风起。毛为绿缥两耳黄, 563 | 眼有紫焰双瞳方。矫矫龙性合变化,卓立天骨森开张。 564 | 伊昔太仆张景顺,监牧攻驹阅清峻。遂令大奴守天育, 565 | 别养骥子怜神俊。当时四十万匹马,张公叹其材尽下。 566 | 故独写真传世人,见之座右久更新。年多物化空形影, 567 | 呜呼健步无由骋。如今岂无騕褭与骅骝, 568 | 时无王良伯乐死即休。 569 | 570 | 缫丝须长不须白,越罗蜀锦金粟尺。象床玉手乱殷红, 571 | 万草千花动凝碧。已悲素质随时染,裂下鸣机色相射。 572 | 美人细意熨帖平,裁缝灭尽针线迹。春天衣著为君舞, 573 | 蛱蝶飞来黄鹂语。落絮游丝亦有情,随风照日宜轻举。 574 | 香汗轻尘污颜色,开新合故置何许。君不见才士汲引难, 575 | 恐惧弃捐忍羁旅。 576 | 577 | 雨中百草秋烂死,阶下决明颜色鲜。著叶满枝翠羽盖, 578 | 开花无数黄金钱。凉风萧萧吹汝急,恐汝后时难独立。 579 | 堂上书生空白头,临风三嗅馨香泣。 580 | 581 | 阑风长雨秋纷纷,四海八荒同一云。去马来牛不复辨, 582 | 浊泾清渭何当分。禾头生耳黍穗黑,农夫田妇无消息。 583 | 城中斗米换衾裯,相许宁论两相直。 584 | 长安布衣谁比数,反锁衡门守环堵。老夫不出长蓬蒿, 585 | 稚子无忧走风雨。雨声飕飕催早寒,胡雁翅湿高飞难。 586 | 秋来未曾见白日,泥污后土何时干。 587 | 588 | 檐前甘菊移时晚,青蕊重阳不堪摘。明日萧条醉尽醒, 589 | 残花烂熳开何益。篱边野外多众芳,采撷细琐升中堂。 590 | 念兹空长大枝叶,结根失所缠风霜。 591 | 592 | 诸公衮衮登台省,广文先生官独冷。甲第纷纷厌粱肉, 593 | 广文先生饭不足。先生有道出羲皇,先生有才过屈宋。 594 | 德尊一代常轗轲,名垂万古知何用。杜陵野客人更嗤, 595 | 被褐短窄鬓如丝。日籴太仓五升米,时赴郑老同襟期。 596 | 得钱即相觅,沽酒不复疑。忘形到尔汝,痛饮真吾师。 597 | 清夜沈沈动春酌,灯前细雨檐花落。但觉高歌有鬼神, 598 | 焉知饿死填沟壑。相如逸才亲涤器,子云识字终投阁。 599 | 先生早赋归去来,石田茅屋荒苍苔。儒术于我何有哉, 600 | 孔丘盗跖俱尘埃。不须闻此意惨怆,生前相遇且衔杯。 601 | 602 | 陆机二十作文赋,汝更小年能缀文。总角草书又神速, 603 | 世上儿子徒纷纷。骅骝作驹已汗血,鸷鸟举翮连青云。 604 | 词源倒流三峡水,笔阵独扫千人军。只今年才十六七, 605 | 射策君门期第一。旧穿杨叶真自知,暂蹶霜蹄未为失。 606 | 偶然擢秀非难取,会是排风有毛质。汝身已见唾成珠, 607 | 汝伯何由发如漆。春光澹沱秦东亭,渚蒲牙白水荇青。 608 | 风吹客衣日杲杲,树搅离思花冥冥。酒尽沙头双玉瓶, 609 | 众宾皆醉我独醒。乃知贫贱别更苦,吞声踯躅涕泪零。 610 | 人生不相见,动如参与商。今夕复何夕,共此灯烛光。 611 | 少壮能几时,鬓发各已苍。访旧半为鬼,惊呼热中肠。 612 | 焉知二十载,重上君子堂。昔别君未婚,儿女忽成行。 613 | 怡然敬父执,问我来何方。问答乃未已,儿女罗酒浆。 614 | 夜雨翦春韭,新炊间黄粱。主称会面难,一举累十觞。 615 | 十觞亦不醉,感子故意长。明日隔山岳,世事两茫茫。 616 | 617 | 今秋乃淫雨,仲月来寒风。群木水光下,万象云气中。 618 | 所思碍行潦,九里信不通。悄悄素浐路,迢迢天汉东。 619 | 愿腾六尺马,背若孤征鸿。划见公子面,超然欢笑同。 620 | 奋飞既胡越,局促伤樊笼。一饭四五起,凭轩心力穷。 621 | 嘉蔬没混浊,时菊碎榛丛。鹰隼亦屈猛,乌鸢何所蒙。 622 | 式瞻北邻居,取适南巷翁。挂席钓川涨,焉知清兴终。 623 | 高标跨苍天,烈风无时休。自非旷士怀,登兹翻百忧。 624 | 方知象教力,足可追冥搜。仰穿龙蛇窟,始出枝撑幽。 625 | 七星在北户,河汉声西流。羲和鞭白日,少昊行清秋。 626 | 秦山忽破碎,泾渭不可求。俯视但一气,焉能辨皇州。 627 | 回首叫虞舜,苍梧云正愁。惜哉瑶池饮,日晏昆仑丘。 628 | 黄鹄去不息,哀鸣何所投。君看随阳雁,各有稻粱谋。 629 | 630 | 平明跨驴出,未知适谁门。权门多噂eR,且复寻诸孙。 631 | 诸孙贫无事,宅舍如荒村。堂前自生竹,堂后自生萱。 632 | 萱草秋已死,竹枝霜不蕃。淘米少汲水,汲多井水浑。 633 | 刈葵莫放手,放手伤葵根。阿翁懒惰久,觉儿行步奔。 634 | 所来为宗族,亦不为盘飧。小人利口实,薄俗难可论。 635 | 勿受外嫌猜,同姓古所敦。 636 | 637 | 出门复入门,两脚但如旧。所向泥活活,思君令人瘦。 638 | 沉吟坐西轩,饮食错昏昼。寸步曲江头,难为一相就。 639 | 吁嗟呼苍生,稼穑不可救。安得诛云师,畴能补天漏。 640 | 大明韬日月,旷野号禽兽。君子强逶迤,小人困驰骤。 641 | 维南有崇山,恐与川浸溜。是节东篱菊,纷披为谁秀。 642 | 岑生多新诗,性亦嗜醇酎。采采黄金花,何由满衣袖。 643 | 644 | 巢父掉头不肯住,东将入海随烟雾。诗卷长留天地间, 645 | 钓竿欲拂珊瑚树。深山大泽龙蛇远,春寒野阴风景暮。 646 | 蓬莱织女回云车,指点虚无是征路。自是君身有仙骨, 647 | 世人那得知其故。惜君只欲苦死留,富贵何如草头露。 648 | 蔡侯静者意有馀,清夜置酒临前除。罢琴惆怅月照席, 649 | 几岁寄我空中书。南寻禹穴见李白,道甫问信今何如。 650 | 知章骑马似乘船,眼花落井水底眠。汝阳三斗始朝天, 651 | 道逢麹车口流涎,恨不移封向酒泉。左相日兴费万钱, 652 | 饮如长鲸吸百川,衔杯乐圣称世贤。宗之潇洒美少年, 653 | 举觞白眼望青天,皎如玉树临风前。苏晋长斋绣佛前, 654 | 醉中往往爱逃禅。李白一斗诗百篇,长安市上酒家眠。 655 | 天子呼来不上船,自称臣是酒中仙。张旭三杯草圣传, 656 | 脱帽露顶王公前,挥毫落纸如云烟。焦遂五斗方卓然, 657 | 高谈雄辨惊四筵。 658 | 659 | 曲江萧条秋气高,菱荷枯折随风涛,游子空嗟垂二毛。 660 | 白石素沙亦相荡,哀鸿独叫求其曹。 661 | 即事非今亦非古,长歌激越梢林莽,比屋豪华固难数。 662 | 吾人甘作心似灰,弟侄何伤泪如雨。 663 | 664 | 自断此生休问天,杜曲幸有桑麻田,故将移住南山边。 665 | 短衣匹马随李广,看射猛虎终残年。 666 | 三月三日天气新,长安水边多丽人。态浓意远淑且真, 667 | 肌理细腻骨肉匀。绣罗衣裳照暮春,蹙金孔雀银麒麟。 668 | 头上何所有,翠微zc叶垂鬓唇。背后何所见, 669 | 珠压腰衱稳称身。就中云幕椒房亲,赐名大国虢与秦。 670 | 紫驼之峰出翠釜,水精之盘行素鳞。犀箸厌饫久未下, 671 | 銮刀缕切空纷纶。黄门飞鞚不动尘,御厨络绎送八珍。 672 | 箫鼓哀吟感鬼神,宾从杂遝实要津。后来鞍马何逡巡, 673 | 当轩下马入锦茵。杨花雪落覆白蘋,青鸟飞去衔红巾。 674 | 炙手可热势绝伦,慎莫近前丞相嗔。 675 | 676 | 乐游古园崒森爽,烟绵碧草萋萋长。公子华筵势最高, 677 | 秦川对酒平如掌。长生木瓢示真率,更调鞍马狂欢赏。 678 | 青春波浪芙蓉园,白日雷霆夹城仗。阊阖晴开昳荡荡, 679 | 曲江翠幕排银榜。拂水低徊舞袖翻,缘云清切歌声上。 680 | 却忆年年人醉时,只今未醉已先悲。数茎白发那抛得, 681 | 百罚深杯亦不辞。圣朝亦知贱士丑,一物自荷皇天慈。 682 | 此身饮罢无归处,独立苍茫自咏诗。 683 | 684 | 岑参兄弟皆好奇,携我远来游渼陂。天地黤惨忽异色, 685 | 波涛万顷堆琉璃。琉璃汗漫泛舟入,事殊兴极忧思集。 686 | 鼍作鲸吞不复知,恶风白浪何嗟及。主人锦帆相为开, 687 | 舟子喜甚无氛埃。凫鹥散乱棹讴发,丝管啁啾空翠来。 688 | 沈竿续蔓深莫测,菱叶荷花静如拭。宛在中流渤澥清, 689 | 下归无极终南黑。半陂已南纯浸山,动影袅窕冲融间。 690 | 船舷暝戛云际寺,水面月出蓝田关。此时骊龙亦吐珠, 691 | 冯夷击鼓群龙趋。湘妃汉女出歌舞,金支翠旗光有无。 692 | 咫尺但愁雷雨至,苍茫不晓神灵意。少壮几时奈老何, 693 | 向来哀乐何其多。 694 | 695 | 高台面苍陂,六月风日冷。蒹葭离披去,天水相与永。 696 | 怀新目似击,接要心已领。仿像识鲛人,空蒙辨鱼艇。 697 | 错磨终南翠,颠倒白阁影。崷崒增光辉,乘陵惜俄顷。 698 | 劳生愧严郑,外物慕张邴。世复轻骅骝,吾甘杂蛙黾。 699 | 知归俗可忽,取适事莫并。身退岂待官,老来苦便静。 700 | 况资菱芡足,庶结茅茨迥。从此具扁舟,弥年逐清景。 701 | 广文到官舍,系马堂阶下。醉则骑马归,颇遭官长骂。 702 | 才名四十年,坐客寒无毡。赖有苏司业,时时与酒钱。 703 | 704 | 远林暑气薄,公子过我游。贫居类村坞,僻近城南楼。 705 | 旁舍颇淳朴,所愿亦易求。隔屋唤西家,借问有酒不。 706 | 墙头过浊醪,展席俯长流。清风左右至,客意已惊秋。 707 | 巢多众鸟斗,叶密鸣蝉稠。苦道此物聒,孰谓吾庐幽。 708 | 水花晚色静,庶足充淹留。预恐尊中尽,更起为君谋。 709 | 710 | 东山气鸿濛,宫殿居上头。君来必十月,树羽临九州。 711 | 阴火煮玉泉,喷薄涨岩幽。有时浴赤日,光抱空中楼。 712 | 阆风入辙迹,旷原延冥搜。沸天万乘动,观水百丈湫。 713 | 幽灵斯可佳,王命官属休。初闻龙用壮,擘石摧林丘。 714 | 中夜窟宅改,移因风雨秋。倒悬瑶池影,屈注苍江流。 715 | 味如甘露浆,挥弄滑且柔。翠旗澹偃蹇,云车纷少留。 716 | 箫鼓荡四溟,异香泱漭浮。鲛人献微绡,曾祝沈豪牛。 717 | 百祥奔盛明,古先莫能俦。坡陀金虾蟆,出见盖有由。 718 | 至尊顾之笑,王母不肯收。复归虚无底,化作长黄虬。 719 | 飘飘青琐郎,文彩珊瑚钩。浩歌渌水曲,清绝听者愁。 720 | 许生五台宾,业白出石壁。余亦师粲可,身犹缚禅寂。 721 | 何阶子方便,谬引为匹敌。离索晚相逢,包蒙欣有击。 722 | 诵诗浑游衍,四座皆辟易。应手看捶钩,清心听鸣镝。 723 | 精微穿溟涬,飞动摧霹雳。陶谢不枝梧,风骚共推激。 724 | 紫燕自超诣,翠驳谁剪剔。君意人莫知,人间夜寥阒。 725 | 先帝昔晏驾,兹山朝百灵。崇冈拥象设,沃野开天庭。 726 | 即事壮重险,论功超五丁。坡陀因厚地,却略罗峻屏。 727 | 云阙虚冉冉,风松肃泠泠。石门霜露白,玉殿莓苔青。 728 | 宫女晚知曙,祠官朝见星。空梁簇画戟,阴井敲铜瓶。 729 | 中使日夜继,惟王心不宁。岂徒恤备享,尚谓求无形。 730 | 孝理敦国政,神凝推道经。瑞芝产庙柱,好鸟鸣岩扃。 731 | 高岳前嵂崒,洪河左滢濙。金城蓄峻址,沙苑交回汀。 732 | 永与奥区固,川原纷眇冥。居然赤县立,台榭争岧亭。 733 | 官属果称是,声华真可听。王刘美竹润,裴李春兰馨。 734 | 郑氏才振古,啖侯笔不停。遣辞必中律,利物常发硎。 735 | 绮绣相展转,琳琅愈青荧。侧闻鲁恭化,秉德崔瑗铭。 736 | 太史候凫影,王乔随鹤翎。朝仪限霄汉,容思回林坰。 737 | 轗轲辞下杜,飘飖陵浊泾。诸生旧短褐,旅泛一浮萍。 738 | 荒岁儿女瘦,暮途涕泗零。主人念老马,廨署容秋萤。 739 | 流寓理岂惬,穷愁醉未醒。何当摆俗累,浩荡乘沧溟。 740 | 君不见左辅白沙如白水,缭以周墙百馀里。 741 | 龙媒昔是渥洼生,汗血今称献于此。苑中騋牝三千匹, 742 | 丰草青青寒不死。食之豪健西域无,每岁攻驹冠边鄙。 743 | 王有虎臣司苑门,入门天厩皆云屯。骕骦一骨独当御, 744 | 春秋二时归至尊。至尊内外马盈亿,伏枥在坰空大存。 745 | 逸群绝足信殊杰,倜傥权奇难具论。累累塠阜藏奔突, 746 | 往往坡陀纵超越。角壮翻同麋鹿游,浮深簸荡鼋鼍窟。 747 | 泉出巨鱼长比人,丹砂作尾黄金鳞。岂知异物同精气, 748 | 虽未成龙亦有神。 749 | 邓公马癖人共知,初得花骢大宛种。夙昔传闻思一见, 750 | 牵来左右神皆竦。雄姿逸态何崷崒,顾影骄嘶自矜宠。 751 | 隅目青荧夹镜悬,肉骏碨礌连钱动。朝来久试华轩下, 752 | 未觉千金满高价。赤汗微生白雪毛,银鞍却覆香罗帕。 753 | 卿家旧赐公取之,天厩真龙此其亚。昼洗须腾泾渭深, 754 | 朝趋可刷幽并夜。吾闻良骥老始成,此马数年人更惊。 755 | 岂有四蹄疾于鸟,不与八骏俱先鸣。时俗造次那得致, 756 | 云雾晦冥方降精。近闻下诏喧都邑,肯使骐驎地上行。 757 | 758 | 君不见鞲上鹰,一饱则飞掣。焉能作堂上燕, 759 | 衔泥附炎热。野人旷荡无靦颜,岂可久在王侯间。 760 | 未试囊中餐玉法,明朝且入蓝田山。 761 | 762 | 杜陵有布衣,老大意转拙。许身一何愚,窃比稷与契。 763 | 居然成濩落,白首甘契阔。盖棺事则已,此志常觊豁。 764 | 穷年忧黎元,叹息肠内热。取笑同学翁,浩歌弥激烈。 765 | 非无江海志,萧洒送日月。生逢尧舜君,不忍便永诀。 766 | 当今廊庙具,构厦岂云缺。葵藿倾太阳,物性固莫夺。 767 | 顾惟蝼蚁辈,但自求其穴。胡为慕大鲸,辄拟偃溟渤。 768 | 以兹悟生理,独耻事干谒。兀兀遂至今,忍为尘埃没。 769 | 终愧巢与由,未能易其节。沈饮聊自适,放歌颇愁绝。 770 | 岁暮百草零,疾风高冈裂。天衢阴峥嵘,客子中夜发。 771 | 霜严衣带断,指直不得结。凌晨过骊山,御榻在嵽嵲。 772 | 蚩尤塞寒空,蹴蹋崖谷滑。瑶池气郁律,羽林相摩戛。 773 | 君臣留欢娱,乐动殷樛嶱。赐浴皆长缨,与宴非短褐。 774 | 彤庭所分帛,本自寒女出。鞭挞其夫家,聚敛贡城阙。 775 | 圣人筐篚恩,实欲邦国活。臣如忽至理,君岂弃此物。 776 | 多士盈朝廷,仁者宜战栗。况闻内金盘,尽在卫霍室。 777 | 中堂舞神仙,烟雾散玉质。暖客貂鼠裘,悲管逐清瑟。 778 | 劝客驼蹄羹,霜橙压香橘。朱门酒肉臭,路有冻死骨。 779 | 荣枯咫尺异,惆怅难再述。北辕就泾渭,官渡又改辙。 780 | 群冰从西下,极目高崒兀。疑是崆峒来,恐触天柱折。 781 | 河梁幸未坼,枝撑声窸窣。行旅相攀援,川广不可越。 782 | 老妻寄异县,十口隔风雪。谁能久不顾,庶往共饥渴。 783 | 入门闻号咷,幼子饥已卒。吾宁舍一哀,里巷亦呜咽。 784 | 所愧为人父,无食致夭折。岂知秋未登,贫窭有仓卒。 785 | 生常免租税,名不隶征伐。抚迹犹酸辛,平人固骚屑。 786 | 默思失业徒,因念远戍卒。忧端齐终南,澒洞不可掇。 787 | 788 | 堂上不合生枫树,怪底江山起烟雾。闻君扫却赤县图, 789 | 乘兴遣画沧洲趣。画师亦无数,好手不可遇。 790 | 对此融心神。知君重毫素。岂但祁岳与郑虔, 791 | 笔迹远过杨契丹。得非悬圃裂,无乃潇湘翻。 792 | 悄然坐我天姥下,耳边已似闻清猿。反思前夜风雨急, 793 | 乃是蒲城鬼神入。元气淋漓障犹湿,真宰上诉天应泣。 794 | 野亭春还杂花远,渔翁暝蹋孤舟立。沧浪水深青溟阔, 795 | 欹岸侧岛秋毫末。不见湘妃鼓瑟时,至今斑竹临江活。 796 | 刘侯天机精,爱画入骨髓。自有两儿郎,挥洒亦莫比。 797 | 大儿聪明到,能添老树巅崖里。小儿心孔开。 798 | 貌得山僧及童子。若耶溪,云门寺。 799 | 吾独胡为在泥滓,青鞋布袜从此始。 800 | 客从南县来,浩荡无与适。旅食白日长,况当朱炎赫。 801 | 高斋坐林杪,信宿游衍阒。清晨陪跻攀,傲睨俯峭壁。 802 | 崇冈相枕带,旷野怀咫尺。始知贤主人,赠此遣愁寂。 803 | 危阶根青冥,曾冰生淅沥。上有无心云,下有欲落石。 804 | 泉声闻复急,动静随所击。鸟呼藏其身,有似惧弹射。 805 | 吏隐道性情,兹焉其窟宅。白水见舅氏,诸翁乃仙伯。 806 | 杖藜长松阴,作尉穷谷僻。为我炊雕胡,逍遥展良觌。 807 | 坐久风颇愁,晚来山更碧。相对十丈蛟,欻翻盘涡坼。 808 | 何得空里雷,殷殷寻地脉。烟氛蔼崷崒,魍魉森惨戚。 809 | 昆仑崆峒颠,回首如不隔。前轩颓反照,巉绝华岳赤。 810 | 兵气涨林峦,川光杂锋镝。知是相公军,铁马云雾积。 811 | 玉觞淡无味,胡羯岂强敌。长歌激屋梁,泪下流衽席。 812 | 人生半哀乐,天地有顺逆。慨彼万国夫,休明备征狄。 813 | 猛将纷填委,庙谋蓄长策。东郊何时开,带甲且来释。 814 | 欲告清宴罢,难拒幽明迫。三叹酒食旁,何由似平昔。 815 | 我经华原来,不复见平陆。北上唯土山,连山走穷谷。 816 | 火云无时出,飞电常在目。自多穷岫雨,行潦相豗蹙。 817 | 蓊匌川气黄,群流会空曲。清晨望高浪,忽谓阴崖踣。 818 | 恐泥窜蛟龙,登危聚麋鹿。枯查卷拔树,礧磈共充塞。 819 | 声吹鬼神下,势阅人代速。不有万穴归,何以尊四渎。 820 | 及观泉源涨,反惧江海覆。漂沙坼岸去,漱壑松柏秃。 821 | 乘陵破山门,回斡裂地轴。交洛赴洪河,及关岂信宿。 822 | 应沈数州没,如听万室哭。秽浊殊未清,风涛怒犹蓄。 823 | 何时通舟车,阴气不黪黩。浮生有荡汩,吾道正羁束。 824 | 人寰难容身,石壁滑侧足。云雷此不已,艰险路更跼。 825 | 普天无川梁,欲济愿水缩。因悲中林士,未脱众鱼腹。 826 | 举头向苍天,安得骑鸿鹄。 827 | 828 | 孟冬十郡良家子,血作陈陶泽中水。野旷天清无战声, 829 | 四万义军同日死。群胡归来血洗箭,仍唱胡歌饮都市。 830 | 都人回面向北啼,日夜更望官军至。 831 | 832 | 我军青坂在东门,天寒饮马太白窟。黄头奚儿日向西, 833 | 数骑弯弓敢驰突。山雪河冰野萧瑟,青是烽烟白人骨。 834 | 焉得附书与我军,忍待明年莫仓卒。 835 | 836 | 少陵野老吞声哭,春日潜行曲江曲。江头宫殿锁千门, 837 | 细柳新蒲为谁绿。忆昔霓旌下南苑,苑中万物生颜色。 838 | 昭阳殿里第一人,同辇随君侍君侧。辇前才人带弓箭, 839 | 白马嚼啮黄金勒。翻身向天仰射云,一箭正坠双飞翼。 840 | 明眸皓齿今何在,血污游魂归不得。清渭东流剑阁深, 841 | 去住彼此无消息。人生有情泪沾臆,江水江花岂终极。 842 | 黄昏胡骑尘满城,欲往城南忘南北。 843 | 844 | 长安城头头白乌,夜飞延秋门上呼。又向人家啄大屋, 845 | 屋底达官走避胡。金鞭断折九马死。骨肉不待同驰驱。 846 | 腰下宝玦青珊瑚,可怜王孙泣路隅。问之不肯道姓名, 847 | 但道困苦乞为奴。已经百日窜荆棘,身上无有完肌肤。 848 | 高帝子孙尽隆准,龙种自与常人殊。豺狼在邑龙在野。 849 | 王孙善保千金躯。不敢长语临交衢,且为王孙立斯须。 850 | 昨夜东风吹血腥,东来橐驼满旧都。朔方健儿好身手, 851 | 昔何勇锐今何愚。窃闻天子已传位,圣德北服南单于。 852 | 花门kO面请雪耻,慎勿出口他人狙。哀哉王孙慎勿疏, 853 | 五陵佳气无时无。 854 | 心在水精域,衣沾春雨时。洞门尽徐步,深院果幽期。 855 | 到扉开复闭,撞钟斋及兹。醍醐长发性,饮食过扶衰。 856 | 把臂有多日,开怀无愧辞。黄鹂度结构,紫鸽下罘罳。 857 | 愚意会所适,花边行自迟。汤休起我病,微笑索题诗。 858 | 859 | 细软青丝履,光明白氎巾。深藏供老宿,取用及吾身。 860 | 自顾转无趣,交情何尚新。道林才不世,惠远德过人。 861 | 雨泻暮檐竹,风吹青井芹。天阴对图画,最觉润龙鳞。 862 | 灯影照无睡,心清闻妙香。夜深殿突兀,风动金锒铛。 863 | 天黑闭春院,地清栖暗芳。玉绳回断绝,铁凤森翱翔。 864 | 梵放时出寺,钟残仍殷床。明朝在沃野,苦见尘沙黄。 865 | 866 | 童儿汲井华,惯捷瓶上手。沾洒不濡地,扫除似无帚。 867 | 明霞烂复阁,霁雾搴高牖。侧塞被径花,飘飖委墀柳。 868 | 艰难世事迫,隐遁佳期后。晤语契深心,那能总箝口。 869 | 奉辞还杖策,暂别终回首。泱泱泥污人,听听国多狗。 870 | 既未免羁绊,时来憩奔走。近公如白雪,执热烦何有。 -------------------------------------------------------------------------------- /Task5-Language Model/models.py: -------------------------------------------------------------------------------- 1 | import torch 2 | import torch.nn as nn 3 | 4 | class LSTM(nn.Module): 5 | def __init__(self, input_size, hidden_size=128, dropout_rate=0.5, layer_num=1): 6 | super(LSTM, self).__init__() 7 | self.hidden_size = hidden_size 8 | self.layer_num = layer_num 9 | if layer_num == 1: 10 | self.lstm = nn.LSTM(input_size, hidden_size, layer_num, batch_first=True) 11 | else: 12 | self.lstm = nn.LSTM(input_size, hidden_size, layer_num, dropout=dropout_rate, batch_first=True) 13 | 14 | self.init_weights() 15 | 16 | def init_weights(self): 17 | for p in self.lstm.parameters(): 18 | if p.dim() > 1: 19 | nn.init.xavier_normal_(p) 20 | else: 21 | p.data.zero_() 22 | 23 | def init_hidden(self, batch_size): 24 | weight = next(self.parameters()) 25 | return (weight.new_zeros(self.layer_num, batch_size, self.hidden_size), 26 | weight.new_zeros(self.layer_num, batch_size, self.hidden_size)) 27 | 28 | def forward(self, x, lens, hidden): 29 | ''' 30 | :param x: (batch, seq_len, input_size) 31 | :param lens: (batch, ), in descending order 32 | :param hidden: tuple(h,c), each has shape (num_layer, batch, hidden_size) 33 | :return: output: (batch, seq_len, hidden_size) 34 | tuple(h,c): each has shape (num_layer, batch, hidden_size) 35 | ''' 36 | packed_x = nn.utils.rnn.pack_padded_sequence(x, lens, batch_first=True) 37 | packed_output, (h, c) = self.lstm(packed_x, hidden) 38 | output, _ = nn.utils.rnn.pad_packed_sequence(packed_output, batch_first=True) 39 | return output, (h, c) 40 | 41 | 42 | class LSTM_LM(nn.Module): 43 | def __init__(self, vocab_size, embed_size, hidden_size=128, dropout_rate=0.2, layer_num=1, max_seq_len=128): 44 | super(LSTM_LM, self).__init__() 45 | self.hidden_size = hidden_size 46 | self.layer_num = layer_num 47 | self.embed = nn.Embedding(vocab_size, embed_size) 48 | self.lstm = LSTM(embed_size, hidden_size, dropout_rate, layer_num) 49 | self.project = nn.Linear(hidden_size, vocab_size) 50 | self.dropout = nn.Dropout(dropout_rate) 51 | self.init_weights() 52 | 53 | def init_weights(self): 54 | nn.init.xavier_normal_(self.embed.weight) 55 | nn.init.xavier_normal_(self.project.weight) 56 | 57 | def forward(self, x, lens, hidden): 58 | ''' 59 | :param x: (batch, seq_len, input_size) 60 | :param lens: (batch, ), in descending order 61 | :param hidden: tuple(h,c), each has shape (num_layer, batch, hidden_size) 62 | :return: output: (batch, seq_len, hidden_size) 63 | tuple(h,c): each has shape (num_layer, batch, hidden_size) 64 | ''' 65 | embed = self.embed(x) 66 | hidden, (h, c) = self.lstm(self.dropout(embed), lens, hidden) # (batch, seq_len, hidden_size) 67 | out = self.project(self.dropout(hidden)) # (batch, seq_len, vocab_size) 68 | return out, (h, c) 69 | -------------------------------------------------------------------------------- /Task5-Language Model/run.py: -------------------------------------------------------------------------------- 1 | # -*- coding:utf8 -*- 2 | import torch 3 | import torch.nn as nn 4 | import torch.optim as optim 5 | from tqdm import tqdm, trange 6 | from tensorboardX import SummaryWriter 7 | from models import LSTM_LM 8 | from util import load_iters 9 | import math 10 | 11 | torch.manual_seed(1) 12 | device = torch.device('cuda' if torch.cuda.is_available() else 'cpu') 13 | 14 | BATCH_SIZE = 64 15 | HIDDEN_DIM = 512 16 | LAYER_NUM = 1 17 | EPOCHS = 200 18 | DROPOUT_RATE = 0.5 19 | LEARNING_RATE = 0.01 20 | MOMENTUM = 0.9 21 | CLIP = 5 22 | DECAY_RATE = 0.05 # learning rate decay rate 23 | EOS_TOKEN = "[EOS]" 24 | DATA_PATH = 'data' 25 | EMBEDDING_SIZE = 200 26 | TEMPERATURE = 0.8 # Higher temperature means more diversity. 27 | MAX_LEN = 64 28 | 29 | 30 | def train(train_iter, dev_iter, loss_func, optimizer, epochs, clip): 31 | for epoch in trange(epochs): 32 | model.train() 33 | total_loss = 0 34 | total_words = 0 35 | for i, batch in enumerate(tqdm(train_iter)): 36 | text, lens = batch.text 37 | if epoch == 0 and i == 0: 38 | tqdm.write(' '.join([TEXT.vocab.itos[i] for i in text[0]])) 39 | tqdm.write(' '.join([str(i.item()) for i in text[0]])) 40 | inputs = text[:, :-1] 41 | targets = text[:, 1:] 42 | init_hidden = model.lstm.init_hidden(inputs.size(0)) 43 | logits, _ = model(inputs, lens - 1, init_hidden) # [EOS] is included in length. 44 | loss = loss_func(logits.reshape(-1, logits.size(-1)), targets.reshape(-1)) 45 | 46 | model.zero_grad() 47 | loss.backward() 48 | nn.utils.clip_grad_norm_(model.parameters(), clip) 49 | optimizer.step() 50 | total_loss += loss.item() 51 | total_words += lens.sum().item() 52 | tqdm.write("Epoch: %d, Train perplexity: %d" % (epoch + 1, math.exp(total_loss / total_words))) 53 | writer.add_scalar('Train_Loss', total_loss, epoch) 54 | eval(dev_iter, True, epoch) 55 | 56 | lr = LEARNING_RATE / (1 + DECAY_RATE * (epoch + 1)) 57 | for param_group in optimizer.param_groups: 58 | param_group['lr'] = lr 59 | 60 | 61 | def eval(data_iter, is_dev=False, epoch=None): 62 | model.eval() 63 | with torch.no_grad(): 64 | total_words = 0 65 | total_loss = 0 66 | for i, batch in enumerate(data_iter): 67 | text, lens = batch.text 68 | inputs = text[:, :-1] 69 | targets = text[:, 1:] 70 | model.zero_grad() 71 | init_hidden = model.lstm.init_hidden(inputs.size(0)) 72 | logits, _ = model(inputs, lens - 1, init_hidden) # [EOS] is included in length. 73 | loss = loss_func(logits.reshape(-1, logits.size(-1)), targets.reshape(-1)) 74 | 75 | total_loss += loss.item() 76 | total_words += lens.sum().item() 77 | if epoch is not None: 78 | tqdm.write( 79 | "Epoch: %d, %s perplexity %.3f" % ( 80 | epoch + 1, "Dev" if is_dev else "Test", math.exp(total_loss / total_words))) 81 | writer.add_scalar('Dev_Loss', total_loss, epoch) 82 | else: 83 | tqdm.write( 84 | "%s perplexity %.3f" % ("Dev" if is_dev else "Test", math.exp(total_loss / total_words))) 85 | 86 | 87 | def generate(eos_idx, word, temperature=0.8): 88 | model.eval() 89 | with torch.no_grad(): 90 | if word in TEXT.vocab.stoi: 91 | idx = TEXT.vocab.stoi[word] 92 | inputs = torch.tensor([idx]) 93 | else: 94 | print("%s is not in vocabulary, choose by random." % word) 95 | prob = torch.ones(len(TEXT.vocab.stoi)) 96 | inputs = torch.multinomial(prob, 1) 97 | idx = inputs[0].item() 98 | 99 | inputs = inputs.unsqueeze(1).to(device) # shape [1, 1] 100 | lens = torch.tensor([1]).to(device) 101 | hidden = tuple([h.to(device) for h in model.lstm.init_hidden(1)]) 102 | poetry = [TEXT.vocab.itos[idx]] 103 | 104 | while idx != eos_idx: 105 | logits, hidden = model(inputs, lens, hidden) # logits: (1, 1, vocab_size) 106 | word_weights = logits.squeeze().div(temperature).exp().cpu() 107 | idx = torch.multinomial(word_weights, 1)[0].item() 108 | inputs.fill_(idx) 109 | poetry.append(TEXT.vocab.itos[idx]) 110 | print("".join(poetry[:-1])) 111 | 112 | 113 | if __name__ == "__main__": 114 | train_iter, dev_iter, test_iter, TEXT = load_iters(EOS_TOKEN, BATCH_SIZE, device, DATA_PATH, MAX_LEN) 115 | pad_idx = TEXT.vocab.stoi[TEXT.pad_token] 116 | eos_idx = TEXT.vocab.stoi[EOS_TOKEN] 117 | model = LSTM_LM(len(TEXT.vocab), EMBEDDING_SIZE, HIDDEN_DIM, DROPOUT_RATE, LAYER_NUM).to(device) 118 | 119 | optimizer = optim.SGD(model.parameters(), lr=LEARNING_RATE, momentum=MOMENTUM) 120 | loss_func = torch.nn.CrossEntropyLoss(ignore_index=pad_idx, reduction="sum") 121 | writer = SummaryWriter("logs") 122 | train(train_iter, dev_iter, loss_func, optimizer, EPOCHS, CLIP) 123 | eval(test_iter, is_dev=False) 124 | try: 125 | while True: 126 | word = input("Input the first word or press Ctrl-C to exit: ") 127 | generate(eos_idx, word.strip(), TEMPERATURE) 128 | except: 129 | pass 130 | -------------------------------------------------------------------------------- /Task5-Language Model/util.py: -------------------------------------------------------------------------------- 1 | from torchtext import data 2 | from torchtext.data import BucketIterator 3 | import os 4 | 5 | 6 | def read_data(input_file, max_length): 7 | with open(input_file, encoding="utf8") as f: 8 | poetries = [] 9 | poetry = [] 10 | for line in f: 11 | contends = line.strip() 12 | if len(poetry) + len(contends) <= max_length: 13 | if contends: 14 | poetry.extend(contends) 15 | else: 16 | poetries.append(poetry) 17 | poetry = [] 18 | else: 19 | poetries.append(poetry) 20 | poetry = list(contends) 21 | if poetry: 22 | poetries.append(poetry) 23 | return poetries 24 | 25 | 26 | class PoetryDataset(data.Dataset): 27 | 28 | def __init__(self, text_field, datafile, max_length, **kwargs): 29 | fields = [("text", text_field)] 30 | datas = read_data(datafile, max_length) 31 | examples = [] 32 | for text in datas: 33 | examples.append(data.Example.fromlist([text], fields)) 34 | super(PoetryDataset, self).__init__(examples, fields, **kwargs) 35 | 36 | 37 | def load_iters(eos_token="[EOS]", batch_size=32, device="cpu", data_path='data', max_length=128): 38 | TEXT = data.Field(eos_token=eos_token, batch_first=True, include_lengths=True) 39 | datas = PoetryDataset(TEXT, os.path.join(data_path, "poetryFromTang.txt"), max_length) 40 | train_data, dev_data, test_data = datas.split([0.8, 0.1, 0.1]) 41 | 42 | TEXT.build_vocab(train_data) 43 | 44 | train_iter, dev_iter, test_iter = BucketIterator.splits( 45 | (train_data, dev_data, test_data), 46 | batch_sizes=(batch_size, batch_size, batch_size), 47 | device=device, 48 | sort_key=lambda x: len(x.text), 49 | sort_within_batch=True, 50 | repeat=False, 51 | shuffle=True 52 | ) 53 | return train_iter, dev_iter, test_iter, TEXT 54 | -------------------------------------------------------------------------------- /pics/ESIM.jpg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/jiacheng-ye/code-for-nlp-beginner/3436050d1d527f2bc20c62b5524a1fe782dd9c54/pics/ESIM.jpg --------------------------------------------------------------------------------