├── MyTransformer.ipynb
├── README.md
├── data.py
├── images
    ├── Positional Encoding.jpg
    ├── Subsequence Mask.jpg
    ├── Transformer_layer.png
    ├── Transformer_structure.jpg
    ├── Transformer_test.jpg
    └── data.jpg
├── model.py
├── test.py
└── train.py


/README.md:
--------------------------------------------------------------------------------
 1 | # MyTransformer_pytorch
 2 | 关于Transformer模型的最简洁pytorch实现，包含详细注释
 3 | 
 4 | > 本实现版本相比参考代码删去了每个模块不必要的返回（如注意力矩阵），力求最精简明晰的实现，适用于初学者入门学习
 5 | 
 6 | - 参考代码有：
 7 |   1. https://wmathor.com/index.php/archives/1455/
 8 |   2. http://nlp.seas.harvard.edu/annotated-transformer/ (哈佛NLP团队实现版本)
 9 | 
10 | 
11 | - file_list
12 |   - MyTransformer.ipynb
13 |     - jupyter notebook中的实现，与.py文件相比，添加了更多的说明文字
14 |   - images
15 |     - 为方便理解绘制的一些图，在 MyTransformer.ipynb 中被用到
16 |   - data.py
17 |     - 数据预处理
18 |   - model.py
19 |     - 模型文件
20 |   - train.py
21 |     - 训练模型
22 |   - test.py
23 |     - 读入训练好的pth模型文件，测试模型，完成一个翻译任务
24 |   - .pth
25 |     - My_Transformer.pth  
26 |       - 是按照原concat写法训练1000次后得到的模型，Loss约为3e-6
27 |     - My_Transformer_concat.pth
28 |       - 是按照我修改后的concat写法训练1000次后得到的模型，Loss也为3e-6
29 |     - MyTransformer_fault.pth
30 |       - 只训练了5个epoch的模型，用于验证所做的测试是有意义的（用此模型预测会出错）
31 |      
32 | - 训练好的模型文件链接：
33 |   - 链接：https://pan.baidu.com/s/133Ud8f0yHV3kFnawdZGsQA 
34 |   - 提取码：2022
35 | - 下载后解压到项目根目录下即可
36 | 
37 | - 我的邮箱：crazystone_lei@163.com
38 |   - 欢迎来信交流
39 |   
40 | - 以上内容均为原创，参考的代码已列出，如要转载请注明出处，best wishes.
41 | 


--------------------------------------------------------------------------------
/data.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | import torch.utils.data as Data
 3 | 
 4 | 
 5 | # S: decoding input 的起始符
 6 | # E: decoding output 的结束符
 7 | # P：意为padding，如果当前句子短于本batch的最长句子，那么用这个符号填补缺失的单词
 8 | sentence = [
 9 |     # enc_input   dec_input    dec_output
10 |     ['ich mochte ein bier P','S i want a beer .', 'i want a beer . E'],
11 |     ['ich mochte ein cola P','S i want a coke .', 'i want a coke . E'],
12 | ]
13 | 
14 | # 词典，padding用0来表示
15 | # 源词典，本例中即德语词典
16 | src_vocab = {'P':0, 'ich':1,'mochte':2,'ein':3,'bier':4,'cola':5}
17 | src_vocab_size = len(src_vocab) # 6
18 | # 目标词典，本例中即英语词典,相比源多了特殊符
19 | tgt_vocab = {'P':0,'i':1,'want':2,'a':3,'beer':4,'coke':5,'S':6,'E':7,'.':8}
20 | # 反向映射词典，idx —— word，原代码那个有点不好理解
21 | idx2word = {v:k for k,v in tgt_vocab.items()}
22 | tgt_vocab_size = len(tgt_vocab) # 9
23 | 
24 | src_len = 5 # 输入序列enc_input的最长序列长度，其实就是最长的那句话的token数，是指一个batch中最长呢还是所有输入数据最长呢
25 | tgt_len = 6 # 输出序列dec_inut/dec_output的最长序列长度
26 | 
27 | # 构建模型输入的Tensor
28 | def make_data(sentence):
29 |     enc_inputs, dec_inputs, dec_outputs = [], [], []
30 |     for i in range(len(sentence)):
31 |         enc_input = [src_vocab[word] for word in sentence[i][0].split()]
32 |         dec_input = [tgt_vocab[word] for word in sentence[i][1].split()]
33 |         dec_output = [tgt_vocab[word] for word in sentence[i][2].split()]
34 | 
35 |         enc_inputs.append(enc_input)
36 |         dec_inputs.append(dec_input)
37 |         dec_outputs.append(dec_output)
38 | 
39 |     # LongTensor是专用于存储整型的，Tensor则可以存浮点、整数、bool等多种类型
40 |     return torch.LongTensor(enc_inputs), torch.LongTensor(dec_inputs), torch.LongTensor(dec_outputs)
41 |     # 返回的形状为 enc_inputs：（2,5）、dec_inputs（2,6）、dec_outputs（2,6）
42 | 
43 | 
44 | # 使用Dataset加载数据
45 | class MyDataSet(Data.Dataset):
46 |     def __init__(self, enc_inputs, dec_inputs, dec_outputs):
47 |         super(MyDataSet, self).__init__()
48 |         self.enc_inputs = enc_inputs
49 |         self.dec_inputs = dec_inputs
50 |         self.dec_outputs = dec_outputs
51 | 
52 |     def __len__(self):
53 |         # 我们前面的enc_inputs.shape = [2,5],所以这个返回的是2
54 |         return self.enc_inputs.shape[0]
55 | 
56 |     # 根据idx返回的是一组 enc_input, dec_input, dec_output
57 |     def __getitem__(self, idx):
58 |         return self.enc_inputs[idx], self.dec_inputs[idx], self.dec_outputs[idx]
59 | 
60 | 
61 | # 获取输入
62 | enc_inputs, dec_inputs, dec_outputs = make_data(sentence)
63 | 
64 | # 构建DataLoader
65 | loader = Data.DataLoader(dataset=MyDataSet(enc_inputs, dec_inputs, dec_outputs), batch_size=2, shuffle=True)
66 | 


--------------------------------------------------------------------------------
/images/Positional Encoding.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/Positional Encoding.jpg


--------------------------------------------------------------------------------
/images/Subsequence Mask.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/Subsequence Mask.jpg


--------------------------------------------------------------------------------
/images/Transformer_layer.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/Transformer_layer.png


--------------------------------------------------------------------------------
/images/Transformer_structure.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/Transformer_structure.jpg


--------------------------------------------------------------------------------
/images/Transformer_test.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/Transformer_test.jpg


--------------------------------------------------------------------------------
/images/data.jpg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/BoXiaolei/MyTransformer_pytorch/69a0f088822f004e9bc1b563a292becb11265a21/images/data.jpg


--------------------------------------------------------------------------------
/model.py:
--------------------------------------------------------------------------------
  1 | from data import *
  2 | import torch
  3 | import numpy as np
  4 | import torch.nn as nn
  5 | 
  6 | 
  7 | # 用来表示一个词的向量长度
  8 | d_model = 512
  9 | # FFN的隐藏层神经元个数
 10 | d_ff = 2048
 11 | # 分头后的q、k、v词向量长度，依照原文我们都设为64
 12 | # 原文：queries and kes of dimention d_k,and values of dimension d_v .所以q和k的长度都用d_k来表示
 13 | d_k = d_v = 64
 14 | # Encoder Layer 和 Decoder Layer的个数
 15 | n_layers = 6
 16 | # 多头注意力中head的个数，原文：we employ h = 8 parallel attention layers, or heads
 17 | n_heads = 8
 18 | 
 19 | 
 20 | class PositionalEncoding(nn.Module):
 21 |     def __init__(self, d_model, dropout=0.1, max_len=5000):  # dropout是原文的0.1，max_len原文没找到
 22 |         '''max_len是假设的一个句子最多包含5000个token,即max_seq_len'''
 23 |         super(PositionalEncoding, self).__init__()
 24 |         self.dropout = nn.Dropout(p=dropout)
 25 |         # 开始位置编码部分,先生成一个max_len * d_model 的tensor，即5000 * 512
 26 |         # 5000是一个句子中最多的token数，512是一个token用多长的向量来表示，5000*512这个矩阵用于表示一个句子的信息
 27 |         pe = torch.zeros(max_len, d_model)
 28 |         pos = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1) # pos的shape为[max_len,1],即[5000,1]
 29 |         # 先把括号内的分式求出来,pos是[5000,1],分母是[256],通过广播机制相乘后是[5000,256]
 30 |         div_term = pos / pow(10000.0, torch.arange(0, d_model, 2).float() / d_model)
 31 |         # 再取正余弦
 32 |         pe[:, 0::2] = torch.sin(div_term)
 33 |         pe[:, 1::2] = torch.cos(div_term)
 34 |         # 一个句子要做一次pe，一个batch中会有多个句子，所以增加一维来用和输入的一个batch的数据相加时做广播
 35 |         pe = pe.unsqueeze(0) # [5000,512] -> [1,5000,512]
 36 |         # 将pe作为固定参数保存到缓冲区，不会被更新
 37 |         self.register_buffer('pe', pe)
 38 | 
 39 |     def forward(self, x):
 40 |         ''' x: [batch_size, seq_len, d_model] '''
 41 |         # 5000是我们预定义的最大的seq_len，就是说我们把最多的情况pe都算好了，用的时候用多少就取多少
 42 |         x = x + self.pe[:, :x.size(1), :] # 加的时候应该也广播了，第一维 1 -> batch_size
 43 |         return self.dropout(x) # return: [batch_size, seq_len, d_model], 和输入的形状相同
 44 | 
 45 | 
 46 | 
 47 | # 将输入序列中的占位符P的token（就是0） mask掉，用于计算注意力
 48 | # 返回一个[batch_size, len_q, len_k]大小的布尔张量，True是需要mask掉的位置
 49 | def get_attn_pad_mask(seq_q, seq_k):
 50 |     # len_q、len_k其实是q的length和k的length，q和k都是一个序列即一个句子，长度即句子中包含的词的数量
 51 |     batch_size, len_q = seq_q.size() # 获取作为q的序列（句子）长度
 52 |     batch_size, len_k = seq_k.size() # 获取作为k的序列长度
 53 |     # seq_k.data.eq(0)返回一个和seq_k等大的布尔张量，seq_k元素等于0的位置为True,否则为False
 54 |     # 然后扩维以保证后续操作的兼容(广播)
 55 |     pad_attn_mask = seq_k.data.eq(0).unsqueeze(1) # pad_attn_mask: [batch_size,1,len_k]
 56 |     # 要为每一个q提供一份k，所以把第二维度扩展了q次，这样只有最后一列是True，正常来说最后一行也需要是True，但是由于作为padding的token对其他词的注意力不重要，所以可以这样写
 57 |     res = pad_attn_mask.expand(batch_size, len_q, len_k)
 58 |     return res # return: [batch_size, len_q, len_k]
 59 |     # 返回的是batch_size个 len_q * len_k的矩阵，内容是True和False，第i行第j列表示的是query的第i个词对key的第j个词的注意力是否无意义，若无意义则为True，有意义的为False（即被padding的位置是True）
 60 | 
 61 | 
 62 | 
 63 | # 用于获取对后续位置的掩码，防止在预测过程中看到未来时刻的输入
 64 | # 原文：to prevent positions from attending to subsequent positions
 65 | def get_attn_subsequence_mask(seq):
 66 |     """seq: [batch_size, tgt_len]"""
 67 |     # batch_size个 tgt_len * tgt_len的mask矩阵
 68 |     attn_shape = [seq.size(0), seq.size(1), seq.size(1)]
 69 |     # np.triu 是生成一个 upper triangular matrix 上三角矩阵，k是相对于主对角线的偏移量
 70 |     # k=1意为不包含主对角线（从主对角线向上偏移1开始）
 71 |     subsequence_mask = np.triu(np.ones(attn_shape), k=1)
 72 |     subsequence_mask = torch.from_numpy(subsequence_mask).byte() # 因为只有0、1所以用byte节省内存
 73 |     return subsequence_mask
 74 |     # return: [batch_size, tgt_len, tgt_len]
 75 | 
 76 | 
 77 | class ScaledDotProductionAttention(nn.Module):
 78 |     def __init__(self):
 79 |         super(ScaledDotProductionAttention, self).__init__()
 80 | 
 81 |     def forward(self, Q, K, V, attn_mask):
 82 |         '''
 83 |         Q: [batch_size, n_heads, len_q, d_k]
 84 |         K: [batch_size, n_heads, len_k, d_k]
 85 |         V: [batch_size, n_heads, len_v(=len_k), d_v] 全文两处用到注意力，一处是self attention，另一处是co attention，前者不必说，后者的k和v都是encoder的输出，所以k和v的形状总是相同的
 86 |         attn_mask: [batch_size, n_heads, seq_len, seq_len]
 87 |         '''
 88 |         # 1) 计算注意力分数QK^T/sqrt(d_k)
 89 |         scores = torch.matmul(Q, K.transpose(-1, -2)) / np.sqrt(d_k)  # scores: [batch_size, n_heads, len_q, len_k]
 90 |         # 2)  进行 mask 和 softmax
 91 |         # mask为True的位置会被设为-1e9
 92 |         scores.masked_fill_(attn_mask, -1e9)
 93 |         attn = nn.Softmax(dim=-1)(scores)  # attn: [batch_size, n_heads, len_q, len_k]
 94 |         # 3) 乘V得到最终的加权和
 95 |         context = torch.matmul(attn, V)  # context: [batch_size, n_heads, len_q, d_v]
 96 |         '''
 97 |         得出的context是每个维度(d_1-d_v)都考虑了在当前维度(这一列)当前token对所有token的注意力后更新的新的值，
 98 |         换言之每个维度d是相互独立的，每个维度考虑自己的所有token的注意力，所以可以理解成1列扩展到多列
 99 | 
100 |         返回的context: [batch_size, n_heads, len_q, d_v]本质上还是batch_size个句子，
101 |         只不过每个句子中词向量维度512被分成了8个部分，分别由8个头各自看一部分，每个头算的是整个句子(一列)的512/8=64个维度，最后按列拼接起来
102 |         '''
103 |         return context # context: [batch_size, n_heads, len_q, d_v]
104 | 
105 | 
106 | 
107 | class MultiHeadAttention(nn.Module):
108 |     def __init__(self):
109 |         super(MultiHeadAttention, self).__init__()
110 |         self.W_Q = nn.Linear(d_model, d_model)
111 |         self.W_K = nn.Linear(d_model, d_model)
112 |         self.W_V = nn.Linear(d_model, d_model)
113 |         self.concat = nn.Linear(d_model, d_model)
114 | 
115 |     def forward(self, input_Q, input_K, input_V, attn_mask):
116 |         '''
117 |         input_Q: [batch_size, len_q, d_model] len_q是作为query的句子的长度，比如enc_inputs（2,5,512）作为输入，那句子长度5就是len_q
118 |         input_K: [batch_size, len_k, d_model]
119 |         input_K: [batch_size, len_v(len_k), d_model]
120 |         attn_mask: [batch_size, seq_len, seq_len]
121 |         '''
122 |         residual, batch_size = input_Q, input_Q.size(0)
123 | 
124 |         # 1）linear projection [batch_size, seq_len, d_model] ->  [batch_size, n_heads, seq_len, d_k/d_v]
125 |         Q = self.W_Q(input_Q).view(batch_size, -1, n_heads, d_k).transpose(1, 2) # Q: [batch_size, n_heads, len_q, d_k]
126 |         K = self.W_K(input_K).view(batch_size, -1, n_heads, d_k).transpose(1, 2) # K: [batch_size, n_heads, len_k, d_k]
127 |         V = self.W_V(input_V).view(batch_size, -1, n_heads, d_v).transpose(1, 2) # V: [batch_size, n_heads, len_v(=len_k), d_v]
128 | 
129 |         # 2）计算注意力
130 |         # 自我复制n_heads次，为每个头准备一份mask
131 |         attn_mask = attn_mask.unsqueeze(1).repeat(1, n_heads, 1, 1)  # attn_mask: [batch_size, n_heads, seq_len, seq_len]
132 |         context = ScaledDotProductionAttention()(Q, K, V, attn_mask) # context: [batch_size, n_heads, len_q, d_v]
133 | 
134 |         # 3）concat部分
135 |         context = torch.cat([context[:,i,:,:] for i in range(context.size(1))], dim=-1)
136 |         output = self.concat(context)  # [batch_size, len_q, d_model]
137 |         return nn.LayerNorm(d_model).cuda()(output + residual)  # output: [batch_size, len_q, d_model]
138 | 
139 |         '''        
140 |         最后的concat部分，网上的大部分实现都采用的是下面这种方式（也是哈佛NLP团队的写法），但是我发现下面这种方式拼回去会使原来的位置乱序，于是并未采用这种写法，实验效果是相近的
141 |         context = context.transpose(1, 2).reshape(batch_size, -1, d_model)
142 |         output = self.linear(context)
143 |         '''
144 | 
145 | 
146 | # 这部分代码很简单，对应模型图中的 Feed Forward和 Add & Norm
147 | class PositionwiseFeedForward(nn.Module):
148 |     def __init__(self):
149 |         super(PositionwiseFeedForward, self).__init__()
150 |         # 就是一个MLP
151 |         self.fc = nn.Sequential(
152 |             nn.Linear(d_model, d_ff),
153 |             nn.ReLU(),
154 |             nn.Linear(d_ff, d_model)
155 |         )
156 | 
157 |     def forward(self, inputs):
158 |         '''inputs: [batch_size, seq_len, d_model]'''
159 |         residual = inputs
160 |         output = self.fc(inputs)
161 |         return nn.LayerNorm(d_model).cuda()(output + residual) # return： [batch_size, seq_len, d_model] 形状不变
162 | 
163 | 
164 | class EncoderLayer(nn.Module):
165 |     def __init__(self):
166 |         super(EncoderLayer, self).__init__()
167 |         self.enc_self_attn = MultiHeadAttention()
168 |         self.pos_ffn = PositionwiseFeedForward()
169 | 
170 |     def forward(self, enc_inputs, enc_self_attn_mask):
171 |         '''
172 |         enc_inputs: [batch_size, src_len, d_model]
173 |         enc_self_attn_mask: [batch_size, src_len, src_len]
174 |         '''
175 |         # Q、K、V均为 enc_inputs
176 |         enc_ouputs = self.enc_self_attn(enc_inputs, enc_inputs, enc_inputs, enc_self_attn_mask) # enc_ouputs: [batch_size, src_len, d_model]
177 |         enc_ouputs = self.pos_ffn(enc_ouputs) # enc_outputs: [batch_size, src_len, d_model]
178 |         return enc_ouputs  # enc_outputs: [batch_size, src_len, d_model]
179 | 
180 | 
181 | class Encoder(nn.Module):
182 |     def __init__(self):
183 |         super(Encoder, self).__init__()
184 |         # 直接调的现成接口完成词向量的编码，输入是类别数和每一个类别要映射成的向量长度
185 |         self.src_emb = nn.Embedding(src_vocab_size, d_model)
186 |         self.pos_emb = PositionalEncoding(d_model)
187 |         self.layers = nn.ModuleList([EncoderLayer() for _ in range(n_layers)])
188 | 
189 |     def forward(self, enc_inputs):
190 |         '''enc_inputs: [batch_size, src_len]'''
191 |         enc_outputs = self.src_emb(enc_inputs) # [batch_size, src_len] -> [batch_size, src_len, d_model]
192 |         enc_outputs = self.pos_emb(enc_outputs) # enc_outputs: [batch_size, src_len, d_model]
193 |         # Encoder中是self attention，所以传入的Q、K都是enc_inputs
194 |         enc_self_attn_mask = get_attn_pad_mask(enc_inputs, enc_inputs)  # enc_self_attn_mask: [batch_size, src_len, src_len]
195 |         for layer in self.layers:
196 |             enc_outputs = layer(enc_outputs, enc_self_attn_mask)
197 |         return enc_outputs  # enc_outputs: [batch_size, src_len, d_model]
198 | 
199 | 
200 | class DecoderLayer(nn.Module):
201 |     def __init__(self):
202 |         super(DecoderLayer, self).__init__()
203 |         self.dec_self_attn = MultiHeadAttention()
204 |         self.dec_enc_attn = MultiHeadAttention()
205 |         self.pos_ffn = PositionwiseFeedForward()
206 | 
207 |     def forward(self, dec_inputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask):
208 |         '''
209 |         dec_inputs: [batch_size, tgt_len, d_model]
210 |         enc_outputs: [batch_size, src_len, d_model]
211 |         dec_self_attn_mask: [batch_size, tgt_len, tgt_len]
212 |         dec_enc_attn_mask: [batch_size, tgt_len, src_len] 前者是Q后者是K
213 |         '''
214 |         dec_outputs = self.dec_self_attn(dec_inputs, dec_inputs, dec_inputs, dec_self_attn_mask)
215 |         dec_outputs = self.dec_enc_attn(dec_outputs, enc_outputs, enc_outputs, dec_enc_attn_mask)
216 |         dec_outputs = self.pos_ffn(dec_outputs)
217 | 
218 |         return dec_outputs # dec_outputs: [batch_size, tgt_len, d_model]
219 | 
220 | 
221 | 
222 | 
223 | class Decoder(nn.Module):
224 |     def __init__(self):
225 |         super(Decoder, self).__init__()
226 |         self.tgt_emb = nn.Embedding(tgt_vocab_size, d_model)
227 |         self.pos_emb = PositionalEncoding(d_model)
228 |         self.layers = nn.ModuleList([DecoderLayer() for _ in range(n_layers)])
229 | 
230 | 
231 |     def forward(self, dec_inputs, enc_inputs, enc_outputs):
232 |         '''
233 |         这三个参数对应的不是Q、K、V，dec_inputs是Q，enc_outputs是K和V，enc_inputs是用来计算padding mask的
234 |         dec_inputs: [batch_size, tgt_len]
235 |         enc_inpus: [batch_size, src_len]
236 |         enc_outputs: [batch_size, src_len, d_model]
237 |         '''
238 |         dec_outputs = self.tgt_emb(dec_inputs)
239 |         dec_outputs = self.pos_emb(dec_outputs).cuda()
240 |         dec_self_attn_pad_mask = get_attn_pad_mask(dec_inputs, dec_inputs).cuda()
241 |         dec_self_attn_subsequence_mask = get_attn_subsequence_mask(dec_inputs).cuda()
242 |         # 将两个mask叠加，布尔值可以视为0和1，和大于0的位置是需要被mask掉的，赋为True，和为0的位置是有意义的为False
243 |         dec_self_attn_mask = torch.gt((dec_self_attn_pad_mask +
244 |                                        dec_self_attn_subsequence_mask), 0).cuda()
245 |         # 这是co-attention部分，为啥传入的是enc_inputs而不是enc_outputs呢
246 |         dec_enc_attn_mask = get_attn_pad_mask(dec_inputs, enc_inputs)
247 | 
248 |         for layer in self.layers:
249 |             dec_outputs = layer(dec_outputs, enc_outputs, dec_self_attn_mask, dec_enc_attn_mask)
250 | 
251 |         return dec_outputs # dec_outputs: [batch_size, tgt_len, d_model]
252 | 
253 | 
254 | class Transformer(nn.Module):
255 |     def __init__(self):
256 |         super(Transformer, self).__init__()
257 |         self.encoder = Encoder().cuda()
258 |         self.decoder = Decoder().cuda()
259 |         self.projection = nn.Linear(d_model, tgt_vocab_size).cuda()
260 | 
261 |     def forward(self, enc_inputs, dec_inputs):
262 |         '''
263 |         enc_inputs: [batch_size, src_len]
264 |         dec_inputs: [batch_size, tgt_len]
265 |         '''
266 |         enc_outputs = self.encoder(enc_inputs)
267 |         dec_outputs = self.decoder(dec_inputs, enc_inputs, enc_outputs)
268 |         dec_logits = self.projection(dec_outputs) # dec_logits: [batch_size, tgt_len, tgt_vocab_size]
269 | 
270 |         # 解散batch，一个batch中有batch_size个句子，每个句子有tgt_len个词（即tgt_len行），现在让他们按行依次排布，如前tgt_len行是第一个句子的每个词的预测概率，再往下tgt_len行是第二个句子的，一直到batch_size * tgt_len行
271 |         return dec_logits.view(-1, dec_logits.size(-1))  #  [batch_size * tgt_len, tgt_vocab_size]
272 |         '''最后变形的原因是：nn.CrossEntropyLoss接收的输入的第二个维度必须是类别'''
273 | 
274 | 


--------------------------------------------------------------------------------
/test.py:
--------------------------------------------------------------------------------
 1 | import torch
 2 | 
 3 | from model import *
 4 | 
 5 | # 原文使用的是大小为4的beam search，这里为简单起见使用更简单的greedy贪心策略生成预测，不考虑候选，每一步选择概率最大的作为输出
 6 | # 如果不使用greedy_decoder，那么我们之前实现的model只会进行一次预测得到['i']，并不会自回归，所以我们利用编写好的Encoder-Decoder来手动实现自回归（把上一次Decoder的输出作为下一次的输入，直到预测出终止符）
 7 | def greedy_decoder(model, enc_input, start_symbol):
 8 |     """enc_input: [1, seq_len] 对应一句话"""
 9 |     enc_outputs = model.encoder(enc_input) # enc_outputs: [1, seq_len, 512]
10 |     # 生成一个1行0列的，和enc_inputs.data类型相同的空张量，待后续填充
11 |     dec_input = torch.zeros(1, 0).type_as(enc_input.data) # .data避免影响梯度信息
12 |     next_symbol = start_symbol
13 |     flag = True
14 |     while flag:
15 |         # dec_input.detach() 创建 dec_input 的一个分离副本
16 |         # 生成了一个 只含有next_symbol的（1,1）的张量
17 |         # -1 表示在最后一个维度上进行拼接cat
18 |         # 这行代码的作用是将next_symbol拼接到dec_input中，作为新一轮decoder的输入
19 |         dec_input = torch.cat([dec_input.detach(), torch.tensor([[next_symbol]], dtype=enc_input.dtype).cuda()], -1) # dec_input: [1,当前词数]
20 |         dec_outputs = model.decoder(dec_input, enc_input, enc_outputs) # dec_outputs: [1, tgt_len, d_model]
21 |         projected = model.projection(dec_outputs) # projected: [1, 当前生成的tgt_len, tgt_vocab_size]
22 |         # max返回的是一个元组（最大值，最大值对应的索引），所以用[1]取到最大值对应的索引, 索引就是类别，即预测出的下一个词
23 |         # keepdim为False会导致减少一维
24 |         prob = projected.squeeze(0).max(dim=-1, keepdim=False)[1] # prob: [1],
25 |         # prob是一个一维的列表，包含目前为止依次生成的词的索引，最后一个是新生成的（即下一个词的类别）
26 |         # 因为注意力是依照前面的词算出来的，所以后生成的不会改变之前生成的
27 |         next_symbol = prob.data[-1]
28 |         if next_symbol == tgt_vocab['.']:
29 |             flag = False
30 |         print(next_symbol)
31 |     return dec_input  # dec_input: [1,tgt_len]
32 | 
33 | 
34 | # 测试
35 | model = torch.load('MyTransformer_temp.pth')
36 | model.eval()
37 | with torch.no_grad():
38 |     # 手动从loader中取一个batch的数据
39 |     enc_inputs, _, _ = next(iter(loader))
40 |     enc_inputs = enc_inputs.cuda()
41 |     for i in range(len(enc_inputs)):
42 |         greedy_dec_input = greedy_decoder(model, enc_inputs[i].view(1, -1), start_symbol=tgt_vocab['S'])
43 |         predict  = model(enc_inputs[i].view(1, -1), greedy_dec_input) # predict: [batch_size * tgt_len, tgt_vocab_size]
44 |         predict = predict.data.max(dim=-1, keepdim=False)[1]
45 |         '''greedy_dec_input是基于贪婪策略生成的，而贪婪解码的输出是基于当前时间步生成的假设的输出。这意味着它可能不是最优的输出，因为它仅考虑了每个时间步的最有可能的单词，而没有考虑全局上下文。
46 |         因此，为了获得更好的性能评估，通常会将整个输入序列和之前的假设输出序列传递给模型，以考虑全局上下文并允许模型更准确地生成输出
47 |         '''
48 |         print(enc_inputs[i], '->', [idx2word[n.item()] for n in predict])
49 | 
50 | 
51 | 
52 | 


--------------------------------------------------------------------------------
/train.py:
--------------------------------------------------------------------------------
 1 | from torch import optim
 2 | 
 3 | from model import *
 4 | 
 5 | 
 6 | model = Transformer().cuda()
 7 | model.train()
 8 | # 损失函数,忽略为0的类别不对其计算loss（因为是padding无意义）
 9 | criterion = nn.CrossEntropyLoss(ignore_index=0)
10 | optimizer = optim.SGD(model.parameters(), lr=1e-3, momentum=0.99)
11 | 
12 | # 训练开始
13 | for epoch in range(1000):
14 |     for enc_inputs, dec_inputs, dec_outputs in loader:
15 |         '''
16 |         enc_inputs: [batch_size, src_len] [2,5]
17 |         dec_inputs: [batch_size, tgt_len] [2,6]
18 |         dec_outputs: [batch_size, tgt_len] [2,6]
19 |         '''
20 |         enc_inputs, dec_inputs, dec_outputs = enc_inputs.cuda(), dec_inputs.cuda(), dec_outputs.cuda()
21 |         outputs = model(enc_inputs, dec_inputs) # outputs: [batch_size * tgt_len, tgt_vocab_size]
22 |         # outputs: [batch_size * tgt_len, tgt_vocab_size], dec_outputs: [batch_size, tgt_len]
23 |         loss = criterion(outputs, dec_outputs.view(-1))  # 将dec_outputs展平成一维张量
24 | 
25 |         # 更新权重
26 |         optimizer.zero_grad()
27 |         loss.backward()
28 |         optimizer.step()
29 | 
30 |         print(f'Epoch [{epoch + 1}/1000], Loss: {loss.item()}')
31 | 
32 | torch.save(model, f'MyTransformer_temp.pth')
33 | 
34 | 


--------------------------------------------------------------------------------