├── README.md ├── 机器学习面试常见问题100道.pdf └── 李沐DeepLearning ├── GRU门控循环单元.ipynb ├── LSTM长短期记忆网络.ipynb ├── README.md ├── RNN循环神经网络.ipynb ├── RNN循环神经网络的简洁实现.ipynb ├── seq2seq序列到序列.ipynb ├── 图片 ├── 文本预处理 │ ├── result1.png │ └── result2.png └── 语言模型 │ ├── fig1.png │ ├── fig2.png │ ├── fig3.png │ ├── fig4.png │ └── fig5.png ├── 数据处理.ipynb ├── 文本预处理 ├── README.md └── 文本预处理.ipynb ├── 时序模型.ipynb ├── 机器翻译数据集.ipynb ├── 线性回归.ipynb ├── 自动求导.ipynb └── 语言模型 ├── README.md └── 语言模型.ipynb /README.md: -------------------------------------------------------------------------------- 1 | # Machine_Learning 2 | 3 | This repository contains my machine learning materials 4 | -------------------------------------------------------------------------------- /机器学习面试常见问题100道.pdf: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/机器学习面试常见问题100道.pdf -------------------------------------------------------------------------------- /李沐DeepLearning/README.md: -------------------------------------------------------------------------------- 1 | ## 本文件夹包含了B站李沐老师的深度学习部分课程的代码(主要是NLP部分) 2 | -------------------------------------------------------------------------------- /李沐DeepLearning/RNN循环神经网络的简洁实现.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "8b907494", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import torch\n", 11 | "from torch import nn\n", 12 | "from torch.nn import functional as F\n", 13 | "from d2l import torch as d2l\n", 14 | "\n", 15 | "%matplotlib inline" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": 2, 21 | "id": "53e41841", 22 | "metadata": {}, 23 | "outputs": [], 24 | "source": [ 25 | "batch_size, num_steps = 32, 35\n", 26 | "train_iter, vocab = d2l.load_data_time_machine(batch_size, num_steps)" 27 | ] 28 | }, 29 | { 30 | "cell_type": "markdown", 31 | "id": "58b4c441", 32 | "metadata": {}, 33 | "source": [ 34 | "首先需要构造一个具有256个隐藏单元的单隐藏层" 35 | ] 36 | }, 37 | { 38 | "cell_type": "code", 39 | "execution_count": 3, 40 | "id": "7c7cc596", 41 | "metadata": {}, 42 | "outputs": [], 43 | "source": [ 44 | "num_hiddens = 256\n", 45 | "rnn_layer = nn.RNN(len(vocab), num_hiddens)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "markdown", 50 | "id": "76b89ea0", 51 | "metadata": {}, 52 | "source": [ 53 | "使用张量来初始化隐状态,隐状态的形状为(隐藏层数,批量大小,隐藏单元数)" 54 | ] 55 | }, 56 | { 57 | "cell_type": "code", 58 | "execution_count": 4, 59 | "id": "0b1b6f52", 60 | "metadata": {}, 61 | "outputs": [ 62 | { 63 | "data": { 64 | "text/plain": [ 65 | "torch.Size([1, 32, 256])" 66 | ] 67 | }, 68 | "execution_count": 4, 69 | "metadata": {}, 70 | "output_type": "execute_result" 71 | } 72 | ], 73 | "source": [ 74 | "state = torch.zeros(1, batch_size, num_hiddens)\n", 75 | "state.shape" 76 | ] 77 | }, 78 | { 79 | "cell_type": "markdown", 80 | "id": "46a9a508", 81 | "metadata": {}, 82 | "source": [ 83 | "现在,通过一个输入和一个隐状态,就能算出往后的隐状态,以及使用隐状态来计算输出。" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": 6, 89 | "id": "9cef140a", 90 | "metadata": {}, 91 | "outputs": [ 92 | { 93 | "data": { 94 | "text/plain": [ 95 | "(torch.Size([35, 32, 256]), torch.Size([1, 32, 256]))" 96 | ] 97 | }, 98 | "execution_count": 6, 99 | "metadata": {}, 100 | "output_type": "execute_result" 101 | } 102 | ], 103 | "source": [ 104 | "X = torch.rand(size=(num_steps, batch_size, len(vocab)))\n", 105 | "# Y_hiddens为单个隐藏层的输出\n", 106 | "Y_hiddens, new_state = rnn_layer(X, state)\n", 107 | "Y_hiddens.shape, new_state.shape" 108 | ] 109 | }, 110 | { 111 | "cell_type": "markdown", 112 | "id": "8947c76a", 113 | "metadata": {}, 114 | "source": [ 115 | "接着为完整的RNN模型构造一个类" 116 | ] 117 | }, 118 | { 119 | "cell_type": "code", 120 | "execution_count": 7, 121 | "id": "d983a7fb", 122 | "metadata": {}, 123 | "outputs": [], 124 | "source": [ 125 | "class RNNModel(nn.Module):\n", 126 | " def __init__(self, rnn_layer, vocab_size, **kwargs):\n", 127 | " super(RNNModel, self).__init__(**kwargs)\n", 128 | " self.rnn = rnn_layer\n", 129 | " self.vocab_size = vocab_size\n", 130 | " self.num_hiddens = self.rnn.hidden_size\n", 131 | " if not self.rnn.bidirectional:\n", 132 | " self.num_direction = 1\n", 133 | " self.linear = nn.Linear(self.num_hiddens, self.vocab_size)\n", 134 | " else:\n", 135 | " self.num_direction = 2\n", 136 | " self.linear = nn.Linear(self.num_hiddens*2, self.vocab_size)\n", 137 | " \n", 138 | " def forward(self, inputs, state):\n", 139 | " X = F.one_hot(inputs.T.long(), self.vocab_size).type(torch.float32)\n", 140 | " # Y为单个隐藏层的输出\n", 141 | " Y, state = self.rnn(X, state)\n", 142 | " # 使用全连接层将输出形状改成(时间步长*批量大小,隐藏单元数)\n", 143 | " outputs = self.linear(Y.reshape(-1, Y.shape[-1]))\n", 144 | " return outputs, state\n", 145 | " \n", 146 | " def begin_state(self, device, batch_size=1):\n", 147 | " if not isinstance(self.rnn, nn.LSTM):\n", 148 | " # nn.GRU以张量为隐状态\n", 149 | " return torch.zeros((self.num_direction*self.rnn.num_layers, batch_size, self.num_hiddens), device=device)\n", 150 | " else:\n", 151 | " # nn.LSTM以元组为隐状态\n", 152 | " return (torch.zeros((self.num_direction*self.rnn.num_layers, batch_size, self.num_hiddens), device=device),\n", 153 | " torch.zeros((self.num_direction*self.rnn.num_layers, batch_size, self.num_hiddens), device=device))" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "id": "0c552f24", 159 | "metadata": {}, 160 | "source": [ 161 | "先使用具有随机权重的模型进行预测" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 8, 167 | "id": "53c409cd", 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "text/plain": [ 173 | "'time travelleruttttttttt'" 174 | ] 175 | }, 176 | "execution_count": 8, 177 | "metadata": {}, 178 | "output_type": "execute_result" 179 | } 180 | ], 181 | "source": [ 182 | "device = d2l.try_gpu()\n", 183 | "net = RNNModel(rnn_layer, vocab_size=len(vocab))\n", 184 | "net = net.to(device)\n", 185 | "d2l.predict_ch8('time traveller', 10, net, vocab, device)" 186 | ] 187 | }, 188 | { 189 | "cell_type": "markdown", 190 | "id": "2dd71cec", 191 | "metadata": {}, 192 | "source": [ 193 | "接着,使用训练函数进行预测" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": 9, 199 | "id": "ecc58cd7", 200 | "metadata": {}, 201 | "outputs": [ 202 | { 203 | "name": "stdout", 204 | "output_type": "stream", 205 | "text": [ 206 | "perplexity 1.3, 289806.6 tokens/sec on cuda:0\n", 207 | "time traveller hald was an youmpand the time travellerit would b\n", 208 | "traveller fof no war fot mathematicians have it isspoken of\n" 209 | ] 210 | }, 211 | { 212 | "data": { 213 | "image/svg+xml": [ 214 | "\n", 215 | "\n", 217 | "\n", 218 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 222 | " 2023-05-18T23:43:02.716421\n", 223 | " image/svg+xml\n", 224 | " \n", 225 | " \n", 226 | " Matplotlib v3.5.1, https://matplotlib.org/\n", 227 | " \n", 228 | " \n", 229 | " \n", 230 | " \n", 231 | " \n", 232 | " \n", 233 | " \n", 234 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 246 | " \n", 252 | " \n", 253 | " \n", 254 | " \n", 255 | " \n", 256 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 262 | " \n", 265 | " \n", 266 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 270 | " \n", 271 | " \n", 272 | " \n", 273 | " \n", 274 | " \n", 288 | " \n", 309 | " \n", 310 | " \n", 311 | " \n", 312 | " \n", 313 | " \n", 314 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 321 | " \n", 322 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 360 | " \n", 361 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 377 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 426 | " \n", 427 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 500 | " \n", 501 | " \n", 502 | " \n", 503 | " \n", 504 | " \n", 505 | " \n", 506 | " \n", 507 | " \n", 508 | " \n", 533 | " \n", 559 | " \n", 580 | " \n", 601 | " \n", 620 | " \n", 621 | " \n", 622 | " \n", 623 | " \n", 624 | " \n", 625 | " \n", 626 | " \n", 627 | " \n", 628 | " \n", 629 | " \n", 630 | " \n", 631 | " \n", 632 | " \n", 635 | " \n", 636 | " \n", 637 | " \n", 638 | " \n", 641 | " \n", 642 | " \n", 643 | " \n", 644 | " \n", 645 | " \n", 646 | " \n", 647 | " \n", 648 | " \n", 649 | " \n", 650 | " \n", 651 | " \n", 652 | " \n", 653 | " \n", 654 | " \n", 655 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 726 | " \n", 727 | " \n", 728 | " \n", 729 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 734 | " \n", 735 | " \n", 736 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 820 | " \n", 827 | " \n", 842 | " \n", 855 | " \n", 876 | " \n", 893 | " \n", 894 | " \n", 895 | " \n", 896 | " \n", 897 | " \n", 898 | " \n", 899 | " \n", 900 | " \n", 901 | " \n", 902 | " \n", 903 | " \n", 904 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 908 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 979 | " \n", 980 | " \n", 981 | " \n", 982 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1002 | " \n", 1003 | " \n", 1004 | " \n", 1037 | " \n", 1056 | " \n", 1057 | " \n", 1058 | " \n", 1059 | " \n", 1060 | " \n", 1061 | " \n", 1062 | " \n", 1063 | " \n", 1064 | " \n", 1065 | " \n", 1066 | " \n", 1067 | " \n", 1068 | " \n", 1069 | " \n", 1070 | " \n", 1071 | " \n", 1072 | "\n" 1073 | ], 1074 | "text/plain": [ 1075 | "
" 1076 | ] 1077 | }, 1078 | "metadata": {}, 1079 | "output_type": "display_data" 1080 | } 1081 | ], 1082 | "source": [ 1083 | "num_epochs, lr = 500, 1\n", 1084 | "d2l.train_ch8(net, train_iter, vocab, lr, num_epochs, device)" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": null, 1090 | "id": "e2880e48", 1091 | "metadata": {}, 1092 | "outputs": [], 1093 | "source": [] 1094 | } 1095 | ], 1096 | "metadata": { 1097 | "kernelspec": { 1098 | "display_name": "Python [conda env:torch] *", 1099 | "language": "python", 1100 | "name": "conda-env-torch-py" 1101 | }, 1102 | "language_info": { 1103 | "codemirror_mode": { 1104 | "name": "ipython", 1105 | "version": 3 1106 | }, 1107 | "file_extension": ".py", 1108 | "mimetype": "text/x-python", 1109 | "name": "python", 1110 | "nbconvert_exporter": "python", 1111 | "pygments_lexer": "ipython3", 1112 | "version": "3.8.16" 1113 | } 1114 | }, 1115 | "nbformat": 4, 1116 | "nbformat_minor": 5 1117 | } 1118 | -------------------------------------------------------------------------------- /李沐DeepLearning/seq2seq序列到序列.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "59f21fee", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import collections\n", 11 | "import math\n", 12 | "import torch\n", 13 | "from torch import nn\n", 14 | "from d2l import torch as d2l" 15 | ] 16 | }, 17 | { 18 | "cell_type": "markdown", 19 | "id": "328d5ed0", 20 | "metadata": {}, 21 | "source": [ 22 | "首先,先定义编码器" 23 | ] 24 | }, 25 | { 26 | "cell_type": "code", 27 | "execution_count": 2, 28 | "id": "101b4534", 29 | "metadata": {}, 30 | "outputs": [], 31 | "source": [ 32 | "class seq2seqEncoder(d2l.Encoder): # 继承自d2l.Encoder这个父类\n", 33 | " def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, dropout=0, **kwargs):\n", 34 | " super(seq2seqEncoder, self).__init__(**kwargs) # 调用父类,使得seq2seq编码器可以继承父类的属性和方法\n", 35 | " self.embedding = nn.Embedding(vocab_size, embed_size)\n", 36 | " self.rnn = nn.GRU(embed_size, num_hiddens, num_layers, dropout=dropout)\n", 37 | " \n", 38 | " def forward(self, X, *args):\n", 39 | " # X的形状为(batch_size, num_steps, embed_size)\n", 40 | " X = self.embedding(X)\n", 41 | " X = X.permute(1, 0, 2) # 改变输入的形状,时间步长放到前面\n", 42 | " output, state = self.rnn(X)\n", 43 | " # output的形状:(num_steps, batch_size, num_hidens),state的形状:(num_layers, batch_size, num_hiddens)\n", 44 | " return output, state" 45 | ] 46 | }, 47 | { 48 | "cell_type": "markdown", 49 | "id": "bb053509", 50 | "metadata": {}, 51 | "source": [ 52 | "实例化编码器测试输出" 53 | ] 54 | }, 55 | { 56 | "cell_type": "code", 57 | "execution_count": 3, 58 | "id": "95dd9d26", 59 | "metadata": {}, 60 | "outputs": [ 61 | { 62 | "data": { 63 | "text/plain": [ 64 | "torch.Size([7, 4, 16])" 65 | ] 66 | }, 67 | "execution_count": 3, 68 | "metadata": {}, 69 | "output_type": "execute_result" 70 | } 71 | ], 72 | "source": [ 73 | "encoder = seq2seqEncoder(vocab_size=10, embed_size=8, num_hiddens=16,\n", 74 | " num_layers=2)\n", 75 | "encoder.eval()\n", 76 | "X = torch.zeros((4, 7), dtype=torch.long)\n", 77 | "output, state = encoder(X)\n", 78 | "output.shape" 79 | ] 80 | }, 81 | { 82 | "cell_type": "code", 83 | "execution_count": 4, 84 | "id": "633d4cc6", 85 | "metadata": {}, 86 | "outputs": [ 87 | { 88 | "data": { 89 | "text/plain": [ 90 | "torch.Size([2, 4, 16])" 91 | ] 92 | }, 93 | "execution_count": 4, 94 | "metadata": {}, 95 | "output_type": "execute_result" 96 | } 97 | ], 98 | "source": [ 99 | "state.shape" 100 | ] 101 | }, 102 | { 103 | "cell_type": "markdown", 104 | "id": "73fd4a53", 105 | "metadata": {}, 106 | "source": [ 107 | "接着定义解码器" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": 5, 113 | "id": "a1aa038d", 114 | "metadata": {}, 115 | "outputs": [], 116 | "source": [ 117 | "class seq2seqDecoder(d2l.Decoder):\n", 118 | " def __init__(self, vocab_size, embed_size, num_hiddens, num_layers, dropout=0, **kwargs):\n", 119 | " super(seq2seqDecoder, self).__init__(**kwargs)\n", 120 | " self.embedding = nn.Embedding(vocab_size, embed_size)\n", 121 | " # 输入为embed_size和num_hiddens的和是因为需要将输入X和上下文进行合并\n", 122 | " self.rnn = nn.GRU(embed_size+num_hiddens, num_hiddens, num_layers, dropout=dropout)\n", 123 | " self.dense = nn.Linear(num_hiddens, vocab_size)\n", 124 | " \n", 125 | " def init_state(self, enc_outputs, *args):\n", 126 | " # 初始化隐状态,取编码器输出的隐状态作为解码器的初始隐状态\n", 127 | " return enc_outputs[1]\n", 128 | " \n", 129 | " def forward(self, X, state):\n", 130 | " # 将X的形状变为(num_steps, batch_size, embed_size)\n", 131 | " X = self.embedding(X).permute(1, 0, 2)\n", 132 | " # 上下文变量取自编码器最后一个时刻输出隐状态的最后一层,并且使其和输入X具有相同的时间步长\n", 133 | " content = state[-1].repeat(X.shape[0], 1, 1)\n", 134 | " # 将输入X和上下文合并\n", 135 | " X_content = torch.cat((X, content), 2)\n", 136 | " output, state = self.rnn(X_content, state)\n", 137 | " output = self.dense(output).permute(1, 0, 2)\n", 138 | " # 输出output形状为(batch_size, num_steps, vocab_size),隐状态state的形状为(num_layers, batch_size, num_hiddens)\n", 139 | " return output, state" 140 | ] 141 | }, 142 | { 143 | "cell_type": "markdown", 144 | "id": "b64ed077", 145 | "metadata": {}, 146 | "source": [ 147 | "实例化解码器测试输出" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": 7, 153 | "id": "faf7ef8d", 154 | "metadata": {}, 155 | "outputs": [ 156 | { 157 | "data": { 158 | "text/plain": [ 159 | "(torch.Size([4, 7, 10]), torch.Size([2, 4, 16]))" 160 | ] 161 | }, 162 | "execution_count": 7, 163 | "metadata": {}, 164 | "output_type": "execute_result" 165 | } 166 | ], 167 | "source": [ 168 | "decoder = seq2seqDecoder(vocab_size=10, embed_size=8, num_hiddens=16,\n", 169 | " num_layers=2)\n", 170 | "decoder.eval()\n", 171 | "state = decoder.init_state(encoder(X))\n", 172 | "output, state = decoder(X, state)\n", 173 | "output.shape, state.shape" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "id": "5e8833f1", 179 | "metadata": {}, 180 | "source": [ 181 | "零值化屏蔽不相关的项" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 8, 187 | "id": "db020d33", 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "def sequence_mask(X, valid_len, value=0):\n", 192 | " # 时间步长设为最大序列长度\n", 193 | " maxlen = X.size(1)\n", 194 | " # 判断有效长度生成掩码\n", 195 | " mask = torch.arange((maxlen), dtype=torch.float32, device=X.device)[None, :] < valid_len[:, None]\n", 196 | " # 对掩码取反屏蔽对应的项\n", 197 | " X[~mask] = value\n", 198 | " return X" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": 9, 204 | "id": "7a69c960", 205 | "metadata": {}, 206 | "outputs": [], 207 | "source": [ 208 | "class MaskedSoftmaxCELoss(nn.CrossEntropyLoss):\n", 209 | " def forward(self, pred, label, valid_len):\n", 210 | " # 按照标签的形状设置一组单位向量作为权重\n", 211 | " weights = torch.ones_like(label)\n", 212 | " # 将这组权重零值化屏蔽不相关的项\n", 213 | " weights = sequence_mask(weights, valid_len)\n", 214 | " self.reduction = 'none'\n", 215 | " # 计算原始的交叉熵损失函数\n", 216 | " unweighted_loss = super(MaskedSoftmaxCELoss, self).forward(pred.permute(0, 2, 1), label)\n", 217 | " # 将损失函数与权重相乘以计算最终的有效损失函数\n", 218 | " weight_loss = (unweighted_loss * weights).mean(dim=1)\n", 219 | " return weight_loss" 220 | ] 221 | }, 222 | { 223 | "cell_type": "markdown", 224 | "id": "e44e510b", 225 | "metadata": {}, 226 | "source": [ 227 | "测试输出" 228 | ] 229 | }, 230 | { 231 | "cell_type": "markdown", 232 | "id": "ece57673", 233 | "metadata": {}, 234 | "source": [ 235 | "训练" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 16, 241 | "id": "8b8d9db0", 242 | "metadata": {}, 243 | "outputs": [], 244 | "source": [ 245 | "def train(net, data_iter, lr, num_epochs, tgt_vocab, device):\n", 246 | " # 初始化权重\n", 247 | " def xavier_init_weights(m):\n", 248 | " if type(m) == nn.Linear:\n", 249 | " nn.init.xavier_uniform_(m.weight)\n", 250 | " if type(m) == nn.GRU:\n", 251 | " for param in m._flat_weights_names:\n", 252 | " if \"weight\" in param:\n", 253 | " nn.init.xavier_uniform_(m._parameters[param])\n", 254 | " \n", 255 | " net.apply(xavier_init_weights)\n", 256 | " net.to(device)\n", 257 | " optimizer = torch.optim.Adam(net.parameters(), lr=lr)\n", 258 | " Loss = MaskedSoftmaxCELoss()\n", 259 | " net.train()\n", 260 | " animator = d2l.Animator(xlabel='epoch', ylabel='loss', xlim=[10, num_epochs])\n", 261 | " \n", 262 | " for epoch in range(num_epochs):\n", 263 | " timer = d2l.Timer()\n", 264 | " metrics = d2l.Accumulator(2)\n", 265 | " \n", 266 | " for batch in data_iter:\n", 267 | " optimizer.zero_grad()\n", 268 | " X, X_valid_len, Y, Y_valid_len = [x.to(device) for x in batch]\n", 269 | " # 提取特定的序列起始词元\n", 270 | " bos = torch.tensor([tgt_vocab['']] * Y.shape[0], device=device).reshape(-1,1)\n", 271 | " # 强制教学,将序列起始词元与原始的输出序列(将序列结束词元剔除)合并作为解码器的输入\n", 272 | " dec_input = torch.cat([bos, Y[:, :-1]], dim=1)\n", 273 | " Y_hat, _ = net(X, dec_input, X_valid_len)\n", 274 | " loss = Loss(Y_hat, Y, Y_valid_len)\n", 275 | " loss.sum().backward()\n", 276 | " d2l.grad_clipping(net, 1)\n", 277 | " num_tokens = Y_valid_len.sum()\n", 278 | " optimizer.step()\n", 279 | " with torch.no_grad():\n", 280 | " metrics.add(loss.sum(), num_tokens)\n", 281 | " if (epoch+1) % 10 == 0:\n", 282 | " animator.add(epoch+1, (metrics[0]/metrics[1]))\n", 283 | " \n", 284 | " print(f'loss {metrics[0]/metrics[1]:.3f},{metrics[1]/timer.stop():.1f}'f'token/sec on {str(device)}')" 285 | ] 286 | }, 287 | { 288 | "cell_type": "markdown", 289 | "id": "262b3293", 290 | "metadata": {}, 291 | "source": [ 292 | "使用机器翻译数据集测试训练函数的输出" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": 17, 298 | "id": "8d2d5497", 299 | "metadata": {}, 300 | "outputs": [ 301 | { 302 | "name": "stdout", 303 | "output_type": "stream", 304 | "text": [ 305 | "loss 0.019,38254.4token/sec on cuda:0\n" 306 | ] 307 | }, 308 | { 309 | "data": { 310 | "image/svg+xml": [ 311 | "\n", 312 | "\n", 314 | "\n", 315 | " \n", 316 | " \n", 317 | " \n", 318 | " \n", 319 | " 2023-06-22T17:06:07.829293\n", 320 | " image/svg+xml\n", 321 | " \n", 322 | " \n", 323 | " Matplotlib v3.5.1, https://matplotlib.org/\n", 324 | " \n", 325 | " \n", 326 | " \n", 327 | " \n", 328 | " \n", 329 | " \n", 330 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 334 | " \n", 340 | " \n", 341 | " \n", 342 | " \n", 343 | " \n", 349 | " \n", 350 | " \n", 351 | " \n", 352 | " \n", 353 | " \n", 356 | " \n", 357 | " \n", 358 | " \n", 359 | " \n", 362 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 366 | " \n", 367 | " \n", 368 | " \n", 369 | " \n", 370 | " \n", 371 | " \n", 396 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 425 | " \n", 428 | " \n", 429 | " \n", 430 | " \n", 431 | " \n", 432 | " \n", 433 | " \n", 434 | " \n", 435 | " \n", 436 | " \n", 437 | " \n", 438 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 464 | " \n", 465 | " \n", 466 | " \n", 467 | " \n", 468 | " \n", 469 | " \n", 470 | " \n", 471 | " \n", 472 | " \n", 473 | " \n", 474 | " \n", 475 | " \n", 476 | " \n", 477 | " \n", 478 | " \n", 479 | " \n", 480 | " \n", 481 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 518 | " \n", 519 | " \n", 520 | " \n", 521 | " \n", 522 | " \n", 523 | " \n", 524 | " \n", 525 | " \n", 526 | " \n", 527 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 540 | " \n", 541 | " \n", 542 | " \n", 543 | " \n", 544 | " \n", 545 | " \n", 546 | " \n", 547 | " \n", 550 | " \n", 551 | " \n", 552 | " \n", 553 | " \n", 554 | " \n", 555 | " \n", 556 | " \n", 557 | " \n", 558 | " \n", 559 | " \n", 560 | " \n", 592 | " \n", 593 | " \n", 594 | " \n", 595 | " \n", 596 | " \n", 597 | " \n", 598 | " \n", 599 | " \n", 600 | " \n", 601 | " \n", 602 | " \n", 603 | " \n", 628 | " \n", 654 | " \n", 675 | " \n", 696 | " \n", 715 | " \n", 716 | " \n", 717 | " \n", 718 | " \n", 719 | " \n", 720 | " \n", 721 | " \n", 722 | " \n", 723 | " \n", 724 | " \n", 725 | " \n", 726 | " \n", 727 | " \n", 730 | " \n", 731 | " \n", 732 | " \n", 733 | " \n", 736 | " \n", 737 | " \n", 738 | " \n", 739 | " \n", 740 | " \n", 741 | " \n", 742 | " \n", 743 | " \n", 744 | " \n", 745 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | " \n", 781 | " \n", 782 | " \n", 783 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 811 | " \n", 812 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 834 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 905 | " \n", 906 | " \n", 907 | " \n", 910 | " \n", 911 | " \n", 912 | " \n", 915 | " \n", 916 | " \n", 917 | " \n", 920 | " \n", 921 | " \n", 922 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | "\n" 934 | ], 935 | "text/plain": [ 936 | "
" 937 | ] 938 | }, 939 | "metadata": {}, 940 | "output_type": "display_data" 941 | } 942 | ], 943 | "source": [ 944 | "embed_size, num_hiddens, num_layers, dropout = 32, 32, 2, 0.1\n", 945 | "batch_size, num_steps = 64, 10\n", 946 | "lr, num_epochs, device = 0.005, 300, d2l.try_gpu()\n", 947 | "\n", 948 | "train_iter, src_vocab, tgt_vocab = d2l.load_data_nmt(batch_size, num_steps)\n", 949 | "encoder = seq2seqEncoder(len(src_vocab), embed_size, num_hiddens, num_layers, dropout)\n", 950 | "decoder = seq2seqDecoder(len(tgt_vocab), embed_size, num_hiddens, num_layers, dropout)\n", 951 | "net = d2l.EncoderDecoder(encoder, decoder)\n", 952 | "train(net, train_iter, lr, num_epochs, tgt_vocab, device)" 953 | ] 954 | }, 955 | { 956 | "cell_type": "code", 957 | "execution_count": 19, 958 | "id": "49b0d998", 959 | "metadata": {}, 960 | "outputs": [ 961 | { 962 | "name": "stdout", 963 | "output_type": "stream", 964 | "text": [ 965 | "torch.Size([2, 3])\n", 966 | "torch.Size([1, 2, 3])\n", 967 | "torch.Size([2, 1, 3])\n" 968 | ] 969 | } 970 | ], 971 | "source": [ 972 | "a1 = torch.tensor([[1,2,3],\n", 973 | " [4,5,6]])\n", 974 | "print(a1.shape)\n", 975 | "\n", 976 | "b1 = torch.unsqueeze(a1, dim=0)\n", 977 | "\n", 978 | "print(b1.shape)\n", 979 | "\n", 980 | "c1 = torch.unsqueeze(a1, dim=1)\n", 981 | "\n", 982 | "print(c1.shape)" 983 | ] 984 | }, 985 | { 986 | "cell_type": "markdown", 987 | "id": "a698a589", 988 | "metadata": {}, 989 | "source": [ 990 | "预测" 991 | ] 992 | }, 993 | { 994 | "cell_type": "code", 995 | "execution_count": 44, 996 | "id": "27fcd8d3", 997 | "metadata": {}, 998 | "outputs": [], 999 | "source": [ 1000 | "def predict(net, src_sentence, src_vocab, tgt_vocab, num_steps, device, save_attention_weights=False):\n", 1001 | " net.eval()\n", 1002 | " src_tokens = src_vocab[src_sentence.lower().split(' ')] + [src_vocab['']]\n", 1003 | " enc_valid_len = torch.tensor([len(src_tokens)], device=device)\n", 1004 | " src_tokens = d2l.truncate_pad(src_tokens, num_steps, src_vocab[''])\n", 1005 | " # 添加批量轴, 在源词元前面加一个batch的维度\n", 1006 | " enc_X = torch.unsqueeze(torch.tensor(src_tokens, dtype=torch.long, device=device), dim=0)\n", 1007 | " enc_output = net.encoder(enc_X, enc_valid_len)\n", 1008 | " dec_state = net.decoder.init_state(enc_output, enc_valid_len)\n", 1009 | " # 添加批量轴,在开始词元前面添加一个batch的维度\n", 1010 | " dec_X = torch.unsqueeze(torch.tensor([tgt_vocab['']], dtype=torch.long, device=device), dim=0)\n", 1011 | " \n", 1012 | " output_seq, save_attention_seq = [], []\n", 1013 | " for _ in range(num_steps):\n", 1014 | " Y, dec_state = net.decoder(dec_X, dec_state)\n", 1015 | " # 选取可能性最高的作为下一个时刻的解码器输入\n", 1016 | " dec_X = Y.argmax(dim=2)\n", 1017 | " # 把batch的维度去掉\n", 1018 | " pred = dec_X.squeeze(dim=0).type(torch.int32).item()\n", 1019 | " # 保存注意力权重\n", 1020 | " if save_attention_weights:\n", 1021 | " save_attention_seq.append(net.decoder.attention_weights)\n", 1022 | " # 当检测到预测为结束词元,结束预测 \n", 1023 | " if pred == tgt_vocab['']:\n", 1024 | " break\n", 1025 | " output_seq.append(pred)\n", 1026 | " return ' '.join(tgt_vocab.to_tokens(output_seq)), save_attention_seq" 1027 | ] 1028 | }, 1029 | { 1030 | "cell_type": "markdown", 1031 | "id": "299f1b55", 1032 | "metadata": {}, 1033 | "source": [ 1034 | "通过BLEU评估预测序列的质量" 1035 | ] 1036 | }, 1037 | { 1038 | "cell_type": "code", 1039 | "execution_count": 38, 1040 | "id": "4567c385", 1041 | "metadata": {}, 1042 | "outputs": [], 1043 | "source": [ 1044 | "def BLEU(pred_seq, label_seq, k):\n", 1045 | " pred_tokens, label_tokens = pred_seq.split(' '), label_seq.split(' ')\n", 1046 | " len_pred, len_label = len(pred_tokens), len(label_tokens)\n", 1047 | " score = math.exp(min(0, 1-len_label/len_pred))\n", 1048 | " \n", 1049 | " for n in range(1, k+1):\n", 1050 | " num_matches, label_subs = 0, collections.defaultdict(int)\n", 1051 | " for i in range(len_label - n + 1):\n", 1052 | " label_subs[' '.join(label_tokens[i: i + n])] += 1\n", 1053 | " for i in range(len_pred - n + 1):\n", 1054 | " if label_subs[' '.join(pred_tokens[i: i + n])] > 0:\n", 1055 | " num_matches += 1\n", 1056 | " label_subs[' '.join(pred_tokens[i: i + n])] -= 1\n", 1057 | " score *= math.pow(num_matches / (len_pred - n + 1), math.pow(0.5, n))\n", 1058 | " return score " 1059 | ] 1060 | }, 1061 | { 1062 | "cell_type": "code", 1063 | "execution_count": 45, 1064 | "id": "9f7fc0a9", 1065 | "metadata": {}, 1066 | "outputs": [ 1067 | { 1068 | "name": "stdout", 1069 | "output_type": "stream", 1070 | "text": [ 1071 | "go . => va !, bleu 0.000\n", 1072 | "i lost . => j'ai ., bleu 0.000\n", 1073 | "he's calm . => il est tombé ?, bleu 0.537\n", 1074 | "i'm home . => je suis certain ., bleu 0.512\n" 1075 | ] 1076 | } 1077 | ], 1078 | "source": [ 1079 | "engs = ['go .', \"i lost .\", 'he\\'s calm .', 'i\\'m home .']\n", 1080 | "fras = ['va !', 'j\\'ai perdu .', 'il est calme .', 'je suis chez moi .']\n", 1081 | "for eng, fra in zip(engs, fras):\n", 1082 | " translation, attention_weight_seq = predict(\n", 1083 | " net, eng, src_vocab, tgt_vocab, num_steps, device)\n", 1084 | " print(f'{eng} => {translation}, bleu {BLEU(translation, fra, k=2):.3f}')" 1085 | ] 1086 | }, 1087 | { 1088 | "cell_type": "code", 1089 | "execution_count": null, 1090 | "id": "58f0d057", 1091 | "metadata": {}, 1092 | "outputs": [], 1093 | "source": [] 1094 | } 1095 | ], 1096 | "metadata": { 1097 | "kernelspec": { 1098 | "display_name": "Python [conda env:.conda-torch] *", 1099 | "language": "python", 1100 | "name": "conda-env-.conda-torch-py" 1101 | }, 1102 | "language_info": { 1103 | "codemirror_mode": { 1104 | "name": "ipython", 1105 | "version": 3 1106 | }, 1107 | "file_extension": ".py", 1108 | "mimetype": "text/x-python", 1109 | "name": "python", 1110 | "nbconvert_exporter": "python", 1111 | "pygments_lexer": "ipython3", 1112 | "version": "3.8.16" 1113 | } 1114 | }, 1115 | "nbformat": 4, 1116 | "nbformat_minor": 5 1117 | } 1118 | -------------------------------------------------------------------------------- /李沐DeepLearning/图片/文本预处理/result1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/文本预处理/result1.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/文本预处理/result2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/文本预处理/result2.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/语言模型/fig1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/语言模型/fig1.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/语言模型/fig2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/语言模型/fig2.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/语言模型/fig3.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/语言模型/fig3.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/语言模型/fig4.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/语言模型/fig4.png -------------------------------------------------------------------------------- /李沐DeepLearning/图片/语言模型/fig5.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/pod2c/Machine_Learning/355a729d156888d5fc8b1653b3e181b638e859e8/李沐DeepLearning/图片/语言模型/fig5.png -------------------------------------------------------------------------------- /李沐DeepLearning/数据处理.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "0567362c", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import torch" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "012b8394", 16 | "metadata": {}, 17 | "source": [ 18 | "张量表示一个数值组成的数组,这个数组可能拥有多个维度" 19 | ] 20 | }, 21 | { 22 | "cell_type": "code", 23 | "execution_count": 3, 24 | "id": "68788921", 25 | "metadata": {}, 26 | "outputs": [ 27 | { 28 | "data": { 29 | "text/plain": [ 30 | "tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])" 31 | ] 32 | }, 33 | "execution_count": 3, 34 | "metadata": {}, 35 | "output_type": "execute_result" 36 | } 37 | ], 38 | "source": [ 39 | "x=torch.arange(10)\n", 40 | "x" 41 | ] 42 | }, 43 | { 44 | "cell_type": "markdown", 45 | "id": "c916d49f", 46 | "metadata": {}, 47 | "source": [ 48 | " 张量的shape属性能够访问张量的形状和张量内的元素总数" 49 | ] 50 | }, 51 | { 52 | "cell_type": "code", 53 | "execution_count": 4, 54 | "id": "1f30a700", 55 | "metadata": {}, 56 | "outputs": [ 57 | { 58 | "data": { 59 | "text/plain": [ 60 | "torch.Size([10])" 61 | ] 62 | }, 63 | "execution_count": 4, 64 | "metadata": {}, 65 | "output_type": "execute_result" 66 | } 67 | ], 68 | "source": [ 69 | "x.shape" 70 | ] 71 | }, 72 | { 73 | "cell_type": "code", 74 | "execution_count": 5, 75 | "id": "ee144095", 76 | "metadata": {}, 77 | "outputs": [ 78 | { 79 | "data": { 80 | "text/plain": [ 81 | "10" 82 | ] 83 | }, 84 | "execution_count": 5, 85 | "metadata": {}, 86 | "output_type": "execute_result" 87 | } 88 | ], 89 | "source": [ 90 | "x.numel() ##表示张量内的元素数量,是一个标量" 91 | ] 92 | }, 93 | { 94 | "cell_type": "markdown", 95 | "id": "4a66e5f8", 96 | "metadata": {}, 97 | "source": [ 98 | "要改变一个张量的形状而不改变张量内的元素的数量和数值需要调用reshape函数" 99 | ] 100 | }, 101 | { 102 | "cell_type": "code", 103 | "execution_count": 6, 104 | "id": "80bdc2b0", 105 | "metadata": {}, 106 | "outputs": [ 107 | { 108 | "data": { 109 | "text/plain": [ 110 | "tensor([[0, 1],\n", 111 | " [2, 3],\n", 112 | " [4, 5],\n", 113 | " [6, 7],\n", 114 | " [8, 9]])" 115 | ] 116 | }, 117 | "execution_count": 6, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "x.reshape(5,2) ##将张量改为5行2列" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "id": "5eae3ff6", 129 | "metadata": {}, 130 | "source": [ 131 | "可以生成一些全是1、全是0或是从特定分布中随机采样的张量" 132 | ] 133 | }, 134 | { 135 | "cell_type": "code", 136 | "execution_count": 7, 137 | "id": "1dab6a82", 138 | "metadata": {}, 139 | "outputs": [ 140 | { 141 | "data": { 142 | "text/plain": [ 143 | "tensor([[[0., 0., 0., 0., 0.],\n", 144 | " [0., 0., 0., 0., 0.],\n", 145 | " [0., 0., 0., 0., 0.],\n", 146 | " [0., 0., 0., 0., 0.]],\n", 147 | "\n", 148 | " [[0., 0., 0., 0., 0.],\n", 149 | " [0., 0., 0., 0., 0.],\n", 150 | " [0., 0., 0., 0., 0.],\n", 151 | " [0., 0., 0., 0., 0.]],\n", 152 | "\n", 153 | " [[0., 0., 0., 0., 0.],\n", 154 | " [0., 0., 0., 0., 0.],\n", 155 | " [0., 0., 0., 0., 0.],\n", 156 | " [0., 0., 0., 0., 0.]]])" 157 | ] 158 | }, 159 | "execution_count": 7, 160 | "metadata": {}, 161 | "output_type": "execute_result" 162 | } 163 | ], 164 | "source": [ 165 | "torch.zeros((3,4,5))" 166 | ] 167 | }, 168 | { 169 | "cell_type": "code", 170 | "execution_count": 8, 171 | "id": "ec04ec80", 172 | "metadata": {}, 173 | "outputs": [ 174 | { 175 | "data": { 176 | "text/plain": [ 177 | "tensor([[[1., 1., 1., 1., 1.],\n", 178 | " [1., 1., 1., 1., 1.],\n", 179 | " [1., 1., 1., 1., 1.],\n", 180 | " [1., 1., 1., 1., 1.]],\n", 181 | "\n", 182 | " [[1., 1., 1., 1., 1.],\n", 183 | " [1., 1., 1., 1., 1.],\n", 184 | " [1., 1., 1., 1., 1.],\n", 185 | " [1., 1., 1., 1., 1.]],\n", 186 | "\n", 187 | " [[1., 1., 1., 1., 1.],\n", 188 | " [1., 1., 1., 1., 1.],\n", 189 | " [1., 1., 1., 1., 1.],\n", 190 | " [1., 1., 1., 1., 1.]]])" 191 | ] 192 | }, 193 | "execution_count": 8, 194 | "metadata": {}, 195 | "output_type": "execute_result" 196 | } 197 | ], 198 | "source": [ 199 | "torch.ones((3,4,5))" 200 | ] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "id": "554cecd1", 205 | "metadata": {}, 206 | "source": [ 207 | "常见的标准运算都能升级成按元素计算" 208 | ] 209 | }, 210 | { 211 | "cell_type": "code", 212 | "execution_count": 9, 213 | "id": "91f8c585", 214 | "metadata": {}, 215 | "outputs": [ 216 | { 217 | "data": { 218 | "text/plain": [ 219 | "(tensor([ 3, 4, 6, 10]),\n", 220 | " tensor([-1, 0, 2, 6]),\n", 221 | " tensor([ 2, 4, 8, 16]),\n", 222 | " tensor([0.5000, 1.0000, 2.0000, 4.0000]),\n", 223 | " tensor([ 1, 4, 16, 64]))" 224 | ] 225 | }, 226 | "execution_count": 9, 227 | "metadata": {}, 228 | "output_type": "execute_result" 229 | } 230 | ], 231 | "source": [ 232 | "x = torch.tensor([1,2,4,8])\n", 233 | "y = torch.tensor([2,2,2,2])\n", 234 | "x+y,x-y,x*y,x/y,x**y" 235 | ] 236 | }, 237 | { 238 | "cell_type": "code", 239 | "execution_count": 10, 240 | "id": "830b6367", 241 | "metadata": {}, 242 | "outputs": [ 243 | { 244 | "data": { 245 | "text/plain": [ 246 | "tensor([2.7183e+00, 7.3891e+00, 5.4598e+01, 2.9810e+03])" 247 | ] 248 | }, 249 | "execution_count": 10, 250 | "metadata": {}, 251 | "output_type": "execute_result" 252 | } 253 | ], 254 | "source": [ 255 | "##指数计算\n", 256 | "torch.exp(x)" 257 | ] 258 | }, 259 | { 260 | "cell_type": "markdown", 261 | "id": "3b071b5c", 262 | "metadata": {}, 263 | "source": [ 264 | "连接多个张量" 265 | ] 266 | }, 267 | { 268 | "cell_type": "code", 269 | "execution_count": 11, 270 | "id": "9972dcff", 271 | "metadata": {}, 272 | "outputs": [ 273 | { 274 | "data": { 275 | "text/plain": [ 276 | "(tensor([[ 0., 1., 2., 3.],\n", 277 | " [ 4., 5., 6., 7.],\n", 278 | " [ 8., 9., 10., 11.],\n", 279 | " [ 2., 1., 4., 3.],\n", 280 | " [ 1., 2., 3., 4.],\n", 281 | " [ 4., 1., 2., 3.]]),\n", 282 | " tensor([[ 0., 1., 2., 3., 2., 1., 4., 3.],\n", 283 | " [ 4., 5., 6., 7., 1., 2., 3., 4.],\n", 284 | " [ 8., 9., 10., 11., 4., 1., 2., 3.]]))" 285 | ] 286 | }, 287 | "execution_count": 11, 288 | "metadata": {}, 289 | "output_type": "execute_result" 290 | } 291 | ], 292 | "source": [ 293 | "X = torch.arange(12, dtype=torch.float32).reshape(3,4)\n", 294 | "Y = torch.tensor([[2,1,4,3],[1,2,3,4],[4,1,2,3]])\n", 295 | "torch.cat((X,Y), dim=0), torch.cat((X,Y), dim=1)" 296 | ] 297 | }, 298 | { 299 | "cell_type": "markdown", 300 | "id": "ac32f85d", 301 | "metadata": {}, 302 | "source": [ 303 | "使用逻辑运算符构建二元张量" 304 | ] 305 | }, 306 | { 307 | "cell_type": "code", 308 | "execution_count": 12, 309 | "id": "5245e6ea", 310 | "metadata": {}, 311 | "outputs": [ 312 | { 313 | "data": { 314 | "text/plain": [ 315 | "tensor([[False, True, False, True],\n", 316 | " [False, False, False, False],\n", 317 | " [False, False, False, False]])" 318 | ] 319 | }, 320 | "execution_count": 12, 321 | "metadata": {}, 322 | "output_type": "execute_result" 323 | } 324 | ], 325 | "source": [ 326 | "X == Y" 327 | ] 328 | }, 329 | { 330 | "cell_type": "markdown", 331 | "id": "7f6d5356", 332 | "metadata": {}, 333 | "source": [ 334 | "对一个张量中的所有元素求和会产生只有一个元素的张量" 335 | ] 336 | }, 337 | { 338 | "cell_type": "code", 339 | "execution_count": 13, 340 | "id": "1ab11c93", 341 | "metadata": {}, 342 | "outputs": [ 343 | { 344 | "data": { 345 | "text/plain": [ 346 | "tensor(66.)" 347 | ] 348 | }, 349 | "execution_count": 13, 350 | "metadata": {}, 351 | "output_type": "execute_result" 352 | } 353 | ], 354 | "source": [ 355 | "X.sum()" 356 | ] 357 | }, 358 | { 359 | "cell_type": "markdown", 360 | "id": "418ddf70", 361 | "metadata": {}, 362 | "source": [ 363 | "面对不同形状的张量运算,可使用广播机制来执行按位操作,结果张量的shape为每个张量的shape对应位置取最大值。" 364 | ] 365 | }, 366 | { 367 | "cell_type": "code", 368 | "execution_count": 14, 369 | "id": "844d747c", 370 | "metadata": {}, 371 | "outputs": [ 372 | { 373 | "data": { 374 | "text/plain": [ 375 | "(tensor([[0],\n", 376 | " [1],\n", 377 | " [2]]),\n", 378 | " tensor([[0, 1]]))" 379 | ] 380 | }, 381 | "execution_count": 14, 382 | "metadata": {}, 383 | "output_type": "execute_result" 384 | } 385 | ], 386 | "source": [ 387 | "a = torch.arange(3).reshape(3,1)\n", 388 | "b = torch.arange(2).reshape(1,2)\n", 389 | "a,b" 390 | ] 391 | }, 392 | { 393 | "cell_type": "code", 394 | "execution_count": 18, 395 | "id": "541c1321", 396 | "metadata": {}, 397 | "outputs": [ 398 | { 399 | "data": { 400 | "text/plain": [ 401 | "tensor([[0, 1],\n", 402 | " [1, 2],\n", 403 | " [2, 3]])" 404 | ] 405 | }, 406 | "execution_count": 18, 407 | "metadata": {}, 408 | "output_type": "execute_result" 409 | } 410 | ], 411 | "source": [ 412 | "c = a + b\n", 413 | "c" 414 | ] 415 | }, 416 | { 417 | "cell_type": "code", 418 | "execution_count": 17, 419 | "id": "9f8af5c9", 420 | "metadata": {}, 421 | "outputs": [ 422 | { 423 | "data": { 424 | "text/plain": [ 425 | "torch.Size([3, 2])" 426 | ] 427 | }, 428 | "execution_count": 17, 429 | "metadata": {}, 430 | "output_type": "execute_result" 431 | } 432 | ], 433 | "source": [ 434 | "c.shape" 435 | ] 436 | }, 437 | { 438 | "cell_type": "markdown", 439 | "id": "3cf73063", 440 | "metadata": {}, 441 | "source": [ 442 | "元素的访问:用[-1]访问最后一个元素,用[1:3]访问第一个和第二个元素" 443 | ] 444 | }, 445 | { 446 | "cell_type": "code", 447 | "execution_count": 19, 448 | "id": "059fb8ca", 449 | "metadata": {}, 450 | "outputs": [ 451 | { 452 | "data": { 453 | "text/plain": [ 454 | "(tensor([ 8., 9., 10., 11.]),\n", 455 | " tensor([[ 4., 5., 6., 7.],\n", 456 | " [ 8., 9., 10., 11.]]))" 457 | ] 458 | }, 459 | "execution_count": 19, 460 | "metadata": {}, 461 | "output_type": "execute_result" 462 | } 463 | ], 464 | "source": [ 465 | "X[-1],X[1:3]" 466 | ] 467 | }, 468 | { 469 | "cell_type": "markdown", 470 | "id": "fe4bde30", 471 | "metadata": {}, 472 | "source": [ 473 | "通过索引改变张量内的元素数值" 474 | ] 475 | }, 476 | { 477 | "cell_type": "code", 478 | "execution_count": 22, 479 | "id": "6fb5820b", 480 | "metadata": {}, 481 | "outputs": [ 482 | { 483 | "data": { 484 | "text/plain": [ 485 | "tensor([[ 0., 1., 2., 3.],\n", 486 | " [ 4., 5., 7., 7.],\n", 487 | " [ 8., 9., 10., 11.]])" 488 | ] 489 | }, 490 | "execution_count": 22, 491 | "metadata": {}, 492 | "output_type": "execute_result" 493 | } 494 | ], 495 | "source": [ 496 | "X[1,2]=7\n", 497 | "X" 498 | ] 499 | }, 500 | { 501 | "cell_type": "markdown", 502 | "id": "4a2684db", 503 | "metadata": {}, 504 | "source": [ 505 | "为多个元素赋值,只需要索引所有元素,然后赋值" 506 | ] 507 | }, 508 | { 509 | "cell_type": "code", 510 | "execution_count": 23, 511 | "id": "e565db0e", 512 | "metadata": {}, 513 | "outputs": [ 514 | { 515 | "data": { 516 | "text/plain": [ 517 | "tensor([[12., 12., 12., 12.],\n", 518 | " [12., 12., 12., 12.],\n", 519 | " [ 8., 9., 10., 11.]])" 520 | ] 521 | }, 522 | "execution_count": 23, 523 | "metadata": {}, 524 | "output_type": "execute_result" 525 | } 526 | ], 527 | "source": [ 528 | "X[0:2,:]=12\n", 529 | "X" 530 | ] 531 | }, 532 | { 533 | "cell_type": "markdown", 534 | "id": "64bc3066", 535 | "metadata": {}, 536 | "source": [ 537 | "pytorch张量转为NumPy张量" 538 | ] 539 | }, 540 | { 541 | "cell_type": "code", 542 | "execution_count": 26, 543 | "id": "80c6af70", 544 | "metadata": {}, 545 | "outputs": [ 546 | { 547 | "data": { 548 | "text/plain": [ 549 | "(numpy.ndarray, torch.Tensor)" 550 | ] 551 | }, 552 | "execution_count": 26, 553 | "metadata": {}, 554 | "output_type": "execute_result" 555 | } 556 | ], 557 | "source": [ 558 | "A=X.numpy()\n", 559 | "B=torch.tensor(A)\n", 560 | "A,B\n", 561 | "type(A),type(B)" 562 | ] 563 | }, 564 | { 565 | "cell_type": "markdown", 566 | "id": "73f9fb59", 567 | "metadata": {}, 568 | "source": [ 569 | "将张量转为python标量" 570 | ] 571 | }, 572 | { 573 | "cell_type": "code", 574 | "execution_count": 28, 575 | "id": "06c2c8eb", 576 | "metadata": {}, 577 | "outputs": [ 578 | { 579 | "data": { 580 | "text/plain": [ 581 | "(tensor(3.5000), 3.5, 3.5, 3)" 582 | ] 583 | }, 584 | "execution_count": 28, 585 | "metadata": {}, 586 | "output_type": "execute_result" 587 | } 588 | ], 589 | "source": [ 590 | "C = torch.tensor(3.5)\n", 591 | "C,C.item(),float(C),int(C)" 592 | ] 593 | }, 594 | { 595 | "cell_type": "markdown", 596 | "id": "3ab68bfc", 597 | "metadata": {}, 598 | "source": [ 599 | "创建一个csv数据集" 600 | ] 601 | }, 602 | { 603 | "cell_type": "code", 604 | "execution_count": 34, 605 | "id": "99a3d717", 606 | "metadata": {}, 607 | "outputs": [], 608 | "source": [ 609 | "import os\n", 610 | "\n", 611 | "os.makedirs(os.path.join('..','data'),exist_ok=True)\n", 612 | "data_file = os.path.join('..','data','house.csv')\n", 613 | "\n", 614 | "with open(data_file,'w') as f:\n", 615 | " f.write('NumRooms,Name,Prince\\n')\n", 616 | " f.write('NA,Alice,256000\\n')\n", 617 | " f.write('3,NA,NA\\n')\n", 618 | " f.write('5,Dave,362110\\n')\n", 619 | " f.write('NA,Alex,NA\\n')" 620 | ] 621 | }, 622 | { 623 | "cell_type": "markdown", 624 | "id": "e328a568", 625 | "metadata": {}, 626 | "source": [ 627 | "使用pandas读取数据集" 628 | ] 629 | }, 630 | { 631 | "cell_type": "code", 632 | "execution_count": 37, 633 | "id": "f34eb725", 634 | "metadata": {}, 635 | "outputs": [ 636 | { 637 | "data": { 638 | "text/html": [ 639 | "
\n", 640 | "\n", 653 | "\n", 654 | " \n", 655 | " \n", 656 | " \n", 657 | " \n", 658 | " \n", 659 | " \n", 660 | " \n", 661 | " \n", 662 | " \n", 663 | " \n", 664 | " \n", 665 | " \n", 666 | " \n", 667 | " \n", 668 | " \n", 669 | " \n", 670 | " \n", 671 | " \n", 672 | " \n", 673 | " \n", 674 | " \n", 675 | " \n", 676 | " \n", 677 | " \n", 678 | " \n", 679 | " \n", 680 | " \n", 681 | " \n", 682 | " \n", 683 | " \n", 684 | " \n", 685 | " \n", 686 | " \n", 687 | " \n", 688 | "
NumRoomsNamePrince
0NaNAlice256000.0
13.0NaNNaN
25.0Dave362110.0
3NaNAlexNaN
\n", 689 | "
" 690 | ], 691 | "text/plain": [ 692 | " NumRooms Name Prince\n", 693 | "0 NaN Alice 256000.0\n", 694 | "1 3.0 NaN NaN\n", 695 | "2 5.0 Dave 362110.0\n", 696 | "3 NaN Alex NaN" 697 | ] 698 | }, 699 | "execution_count": 37, 700 | "metadata": {}, 701 | "output_type": "execute_result" 702 | } 703 | ], 704 | "source": [ 705 | "import pandas as pd\n", 706 | "\n", 707 | "data = pd.read_csv(data_file)\n", 708 | "data" 709 | ] 710 | }, 711 | { 712 | "cell_type": "markdown", 713 | "id": "67adfc62", 714 | "metadata": {}, 715 | "source": [ 716 | "通过插值和删除来处理缺失值" 717 | ] 718 | }, 719 | { 720 | "cell_type": "code", 721 | "execution_count": 39, 722 | "id": "50c1c64a", 723 | "metadata": {}, 724 | "outputs": [ 725 | { 726 | "name": "stderr", 727 | "output_type": "stream", 728 | "text": [ 729 | "C:\\Users\\pod2g\\AppData\\Local\\Temp\\ipykernel_7636\\3505320478.py:2: FutureWarning: The default value of numeric_only in DataFrame.mean is deprecated. In a future version, it will default to False. In addition, specifying 'numeric_only=None' is deprecated. Select only valid columns or specify the value of numeric_only to silence this warning.\n", 730 | " inputs = inputs.fillna(inputs.mean())\n" 731 | ] 732 | }, 733 | { 734 | "data": { 735 | "text/html": [ 736 | "
\n", 737 | "\n", 750 | "\n", 751 | " \n", 752 | " \n", 753 | " \n", 754 | " \n", 755 | " \n", 756 | " \n", 757 | " \n", 758 | " \n", 759 | " \n", 760 | " \n", 761 | " \n", 762 | " \n", 763 | " \n", 764 | " \n", 765 | " \n", 766 | " \n", 767 | " \n", 768 | " \n", 769 | " \n", 770 | " \n", 771 | " \n", 772 | " \n", 773 | " \n", 774 | " \n", 775 | " \n", 776 | " \n", 777 | " \n", 778 | " \n", 779 | " \n", 780 | "
NumRoomsName
04.0Alice
13.0NaN
25.0Dave
34.0Alex
\n", 781 | "
" 782 | ], 783 | "text/plain": [ 784 | " NumRooms Name\n", 785 | "0 4.0 Alice\n", 786 | "1 3.0 NaN\n", 787 | "2 5.0 Dave\n", 788 | "3 4.0 Alex" 789 | ] 790 | }, 791 | "execution_count": 39, 792 | "metadata": {}, 793 | "output_type": "execute_result" 794 | } 795 | ], 796 | "source": [ 797 | "inputs, outputs = data.iloc[:,0:2], data.iloc[:,2]\n", 798 | "inputs = inputs.fillna(inputs.mean())\n", 799 | "inputs" 800 | ] 801 | }, 802 | { 803 | "cell_type": "markdown", 804 | "id": "2a51984c", 805 | "metadata": {}, 806 | "source": [ 807 | "对于数据集中的离散值或是类别值,可以将Na视为一个类别" 808 | ] 809 | }, 810 | { 811 | "cell_type": "code", 812 | "execution_count": 40, 813 | "id": "40ea93c9", 814 | "metadata": {}, 815 | "outputs": [ 816 | { 817 | "data": { 818 | "text/html": [ 819 | "
\n", 820 | "\n", 833 | "\n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | "
NumRoomsName_AlexName_AliceName_DaveName_nan
04.00100
13.00001
25.00010
34.01000
\n", 879 | "
" 880 | ], 881 | "text/plain": [ 882 | " NumRooms Name_Alex Name_Alice Name_Dave Name_nan\n", 883 | "0 4.0 0 1 0 0\n", 884 | "1 3.0 0 0 0 1\n", 885 | "2 5.0 0 0 1 0\n", 886 | "3 4.0 1 0 0 0" 887 | ] 888 | }, 889 | "execution_count": 40, 890 | "metadata": {}, 891 | "output_type": "execute_result" 892 | } 893 | ], 894 | "source": [ 895 | "inputs = pd.get_dummies(inputs, dummy_na=True)\n", 896 | "inputs" 897 | ] 898 | }, 899 | { 900 | "cell_type": "markdown", 901 | "id": "35b6f411", 902 | "metadata": {}, 903 | "source": [ 904 | "在将数据集中的数据都转化为数值后,可以将这些数值转化为tensor张量" 905 | ] 906 | }, 907 | { 908 | "cell_type": "code", 909 | "execution_count": 41, 910 | "id": "ebda561b", 911 | "metadata": {}, 912 | "outputs": [ 913 | { 914 | "data": { 915 | "text/plain": [ 916 | "(tensor([[4., 0., 1., 0., 0.],\n", 917 | " [3., 0., 0., 0., 1.],\n", 918 | " [5., 0., 0., 1., 0.],\n", 919 | " [4., 1., 0., 0., 0.]], dtype=torch.float64),\n", 920 | " tensor([256000., nan, 362110., nan], dtype=torch.float64))" 921 | ] 922 | }, 923 | "execution_count": 41, 924 | "metadata": {}, 925 | "output_type": "execute_result" 926 | } 927 | ], 928 | "source": [ 929 | "X, y = torch.tensor(inputs.values),torch.tensor(outputs.values)\n", 930 | "X, y" 931 | ] 932 | }, 933 | { 934 | "cell_type": "code", 935 | "execution_count": null, 936 | "id": "1861ef0b", 937 | "metadata": {}, 938 | "outputs": [], 939 | "source": [] 940 | } 941 | ], 942 | "metadata": { 943 | "kernelspec": { 944 | "display_name": "Python [conda env:torch] *", 945 | "language": "python", 946 | "name": "conda-env-torch-py" 947 | }, 948 | "language_info": { 949 | "codemirror_mode": { 950 | "name": "ipython", 951 | "version": 3 952 | }, 953 | "file_extension": ".py", 954 | "mimetype": "text/x-python", 955 | "name": "python", 956 | "nbconvert_exporter": "python", 957 | "pygments_lexer": "ipython3", 958 | "version": "3.8.16" 959 | } 960 | }, 961 | "nbformat": 4, 962 | "nbformat_minor": 5 963 | } 964 | -------------------------------------------------------------------------------- /李沐DeepLearning/文本预处理/README.md: -------------------------------------------------------------------------------- 1 | 最近在B站上跟着李沐老师学NLP,在这里把文本预处理的代码做一个小总结。 2 | 3 | ### 一. 导入文本 4 | ```python 5 | d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt','090b5e7e70c295757f55df93cb0a180b9691891a') 6 | 7 | def read_book(): 8 | with open(d2l.download('time_machine'), 'r') as f: 9 | lines = f.readlines() 10 | return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines] 11 | 12 | lines = read_book() 13 | print(lines[0]) 14 | print(lines[10]) 15 | ``` 16 | 17 | 这里使用了Dive to Learning提供的d2l库,方便导入文本数据。 18 | 19 | 在导入文本数据后,构造一个函数来读取数据中的文本行,此处为了简化数据集,将文本中除了英文字母以外的符号全部变成空格,并且将大写字母转为小写字母。 20 | 21 | ### 二. 词元化 22 | 23 | 词元化是将一个个文本行(lines)作为输入,将文本行中的词汇拆开来变成一个个词元。词元是文本的基本单位。 24 | ```python 25 | def tokenize(lines, token='word'): 26 | if (token == 'word'): 27 | return [line.split() for line in lines] 28 | elif (token == 'char'): 29 | return [list(line) for line in lines] 30 | else: 31 | print ('Error Token Type:' + token) 32 | 33 | tokens = tokenize(lines) 34 | for i in range(22): 35 | print(tokens[i]) 36 | ``` 37 | 这里构造一个tokenize函数,其输入为一个包含若干个文本行数据的列表以及一个token用作分辨词元类型。在此函数中,若token为单词(word),则使用split函数将文本行中的单词逐个拆分,然后返回一个包含若干个单词的列表;若token为字符(char),则使用list函数将文本行中的字母逐个拆分,然后返回包含若干个字母的列表;最后若输入的token无法识别则返回Error。 38 | 39 | ### 三. 构建词汇表 40 | 41 | 词元的数据类型为字符串,而深度学习模型要求的输入为数字,单纯用词元不符合模型的输入要求,需要将词元映射到从0开始的数字索引当中。首先先将所有的文本数据合并,接着对每个唯一词元进行频率统计,统计结果被称为语料库(corpus),然后对每个词元的出现频率分配一个数字索引。很少出现的词元将被删除以降低复杂性。并且对于不存在语料库中的词元或者已经删除的词元都将被映射到一个未知词元中。通常地,可以人为地增加一个列表,用于保存那些被保留的词元,例如序列开始次元表示一个句子的开始,序列结束词元表示一个句子的结束。 42 | ```python 43 | class Vocab: 44 | def __init__(self, tokens=None, mini_freq=0, reserved_token=None): 45 | """文本词汇表""" 46 | if(tokens is None): 47 | tokens = [ ] 48 | if(reserved_token is None): 49 | reserved_token = [ ] 50 | counter = corpus_counter(tokens) #计算词元频率构造语料库 51 | self.token_freq = sorted(counter.items(), key=lambda x:x[1], reverse=True) #将词元频率按照出现频率从高到低排列 52 | 53 | self.unk, uniq_tokens = 0, [''] + reserved_token #构造一个能够存放词元的字典 54 | #对于语料库中出现频率满足设定的最小频率的词元以及不在字典中的词元,逐个将这些满足条件的词元放入字典中。 55 | uniq_tokens += [token for token, freq in self.token_freq if freq >= mini_freq and token not in uniq_tokens] 56 | self.token_to_idx = dict() #给定词元返回数字索引 57 | self.idx_to_token = [ ] #给定数字索引返回词元 58 | #将数字索引和字典中的词元一一对应 59 | for token in uniq_tokens: 60 | self.idx_to_token.append(token) 61 | self.token_to_idx[token] = len(self.idx_to_token) - 1 62 | 63 | def __len__(self): 64 | """返回储存词元字典的长度""" 65 | return len(self.idx_to_token) 66 | 67 | def __getitem__(self, tokens): 68 | """输入一个词元,返回一个数字索引""" 69 | if not isinstance(tokens, (list, tuple)): 70 | return self.token_to_idx.get(tokens, self.unk) 71 | return [self.__getitem__(token) for token in tokens] 72 | 73 | def to_token(self, indices): 74 | """输入一个数字索引,返回一个词元""" 75 | if not isinstance(indices, (list, tuple)): 76 | return self.idx_to_token.get[indices] 77 | return [self.to_token[idx] for idx in indices] 78 | 79 | def corpus_counter(tokens): 80 | """统计词频""" 81 | if (len(tokens)==0 or isinstance(tokens[0], list)): 82 | """统计词元的出现频率""" 83 | tokens = [token for line in tokens for token in line] 84 | return collections.Counter(tokens) 85 | ``` 86 | 在这部分,需要构造一个Vocab类用来处理词元和索引之间的关系和一个corpus_counter函数来统计词元的出现频率以构造语料库。首先,构造corpus_counter函数来建立语料库。对于文本数据当中的每一个唯一的词元,使用collections中的Counter()进行统计,最后返回一个词元出现频率的列表。 87 | 88 | 接着构造Vocab类,在此类中包含三大类函数:第一类函数__init__()用来定义和初始化变量。 89 | 90 | 1. 输入的变量应该是一个多个词元组成的token列表,并且设置一个mini_freq作为词元出现的最小频率用于后面过滤出现频率太少的词元以及一个用来储存保留词元的reserved_token。 91 | 2. 先使用corpus_counter函数构造一个语料库来储存词元出现频率,接着还需要对词元的出现频率按照从高到低的排序。 92 | 3. 然后声明一个unk变量,初始值为0,用来储存不在语料库中的词元和已经删除的词元。之后声明一个字典uniq_token来储存词元以及词元对应的出现频率(包括未知词元unk和保留词元reserved_token)。 93 | 4. 接着,对于语料库中出现频率满足设定的最小频率的词元以及不在字典中的词元,逐个将这些满足条件的词元放入字典中。 94 | 5. 下一步需要声明两个变量用于词元和数字索引之间的转化。 95 | 6. 最后一步,需要把词元逐个放入到idx_to_token中用于给定数字索引时返回对应词元,同时将词元对应的数字索引放入token_to_idx中用于给定词元返回数字索引。 96 | 97 | 第二类函数__len__()用来返回储存词元字典的长度。 98 | 99 | 第三类函数包含两个函数__getitem__()和to_token()。__getitem__()用来给定一个词元返回对应的数字索引,to_token()用来给定一个数字索引返回对应的词元。 100 | 101 | ### 四. 加入真实数据集 102 | 103 | 这一步,使用之前导入的时光机器数据来构造词汇表,并且打印部分高频词元。 104 | ```python 105 | vocab = Vocab(tokens) 106 | print(list(vocab.token_to_idx.items())[:10]) 107 | ``` 108 | 运行结果: 109 | 110 | 111 | 112 | 113 | 114 | ### 五. 将文本行转化为数字索引列表 115 | ```python 116 | for i in [0, 10]: 117 | print('word:',tokens[i]) 118 | print('index:',vocab[tokens[i]]) 119 | ``` 120 | 121 | 运行结果: 122 | 123 | 124 | ### 六. 整合所有功能 125 | 现在将之前的所有功能合并到一个函数load_corpus_time_machine()当中,此函数最终返回一个词元的索引列表corpus和一个词汇表vocabu。 126 | ```python 127 | def load_corpus_time_machine(max_token=-1): 128 | lines = read_book() #导入文本数据 129 | tokens = tokenize(lines, 'char') #拆分文本数据转为词元 130 | vocabu = Vocab(tokens) #构造词汇表 131 | corpus = [vocabu[token] for line in tokens for token in line] #得到词元索引列表 132 | 133 | if (max_token > 0): 134 | corpus = corpus[:max_token] #按设置好的数量提取需要用来训练的词元 135 | return vocabu, corpus #返回词汇表以及数字索引列表 136 | 137 | vocabu, corpus = load_corpus_time_machine() 138 | len(vocabu), len(corpus) 139 | ``` 140 | 需要注意的是: 141 | 142 | 1. 为了简化训练,这里使用字符(而不是单词)实现文本词元化; 143 | 2. 时光机器数据集中的每个文本行不一定是一个句子或一个段落,还可能是一个单词,因此返回的corpus仅处理为单个列表,而不是使用多词元列表构成的一个列表。 144 | -------------------------------------------------------------------------------- /李沐DeepLearning/文本预处理/文本预处理.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "75971f1d", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "from d2l import torch as d2l\n", 11 | "import collections\n", 12 | "import re" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "ee3e072f", 18 | "metadata": {}, 19 | "source": [ 20 | "导入一本书的数据集并且转化为一系列的文本" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "id": "d2026236", 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "the time machine by h g wells\n", 34 | "twinkled and his usually pale face was flushed and animated the\n" 35 | ] 36 | } 37 | ], 38 | "source": [ 39 | "d2l.DATA_HUB['time_machine'] = (d2l.DATA_URL + 'timemachine.txt','090b5e7e70c295757f55df93cb0a180b9691891a')\n", 40 | "\n", 41 | "def read_book():\n", 42 | " with open(d2l.download('time_machine'), 'r') as f:\n", 43 | " lines = f.readlines()\n", 44 | " return [re.sub('[^A-Za-z]+', ' ', line).strip().lower() for line in lines]\n", 45 | "\n", 46 | "lines = read_book()\n", 47 | "print(lines[0])\n", 48 | "print(lines[10])" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "98137cd1", 54 | "metadata": {}, 55 | "source": [ 56 | "词元化:tokenize函数将文本行列表(lines)作为输入,此列表中的元素为一个个文本序列,tokenize函数将每个文本序列拆开成为一个个词元(token),词元是文本的基本单位,最后函数会返回一个由词元构成的列表(list)。" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 3, 62 | "id": "5ab85a7a", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "name": "stdout", 67 | "output_type": "stream", 68 | "text": [ 69 | "['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n", 70 | "[]\n", 71 | "[]\n", 72 | "[]\n", 73 | "[]\n", 74 | "['i']\n", 75 | "[]\n", 76 | "[]\n", 77 | "['the', 'time', 'traveller', 'for', 'so', 'it', 'will', 'be', 'convenient', 'to', 'speak', 'of', 'him']\n", 78 | "['was', 'expounding', 'a', 'recondite', 'matter', 'to', 'us', 'his', 'grey', 'eyes', 'shone', 'and']\n", 79 | "['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n", 80 | "['fire', 'burned', 'brightly', 'and', 'the', 'soft', 'radiance', 'of', 'the', 'incandescent']\n", 81 | "['lights', 'in', 'the', 'lilies', 'of', 'silver', 'caught', 'the', 'bubbles', 'that', 'flashed', 'and']\n", 82 | "['passed', 'in', 'our', 'glasses', 'our', 'chairs', 'being', 'his', 'patents', 'embraced', 'and']\n", 83 | "['caressed', 'us', 'rather', 'than', 'submitted', 'to', 'be', 'sat', 'upon', 'and', 'there', 'was', 'that']\n", 84 | "['luxurious', 'after', 'dinner', 'atmosphere', 'when', 'thought', 'roams', 'gracefully']\n", 85 | "['free', 'of', 'the', 'trammels', 'of', 'precision', 'and', 'he', 'put', 'it', 'to', 'us', 'in', 'this']\n", 86 | "['way', 'marking', 'the', 'points', 'with', 'a', 'lean', 'forefinger', 'as', 'we', 'sat', 'and', 'lazily']\n", 87 | "['admired', 'his', 'earnestness', 'over', 'this', 'new', 'paradox', 'as', 'we', 'thought', 'it']\n", 88 | "['and', 'his', 'fecundity']\n", 89 | "[]\n", 90 | "['you', 'must', 'follow', 'me', 'carefully', 'i', 'shall', 'have', 'to', 'controvert', 'one', 'or', 'two']\n" 91 | ] 92 | } 93 | ], 94 | "source": [ 95 | "def tokenize(lines, token='word'):\n", 96 | " if (token == 'word'):\n", 97 | " return [line.split() for line in lines]\n", 98 | " elif (token == 'char'):\n", 99 | " return [list(line) for line in lines]\n", 100 | " else:\n", 101 | " print ('Error Token Type:' + token)\n", 102 | "\n", 103 | "tokens = tokenize(lines)\n", 104 | "for i in range(22):\n", 105 | " print(tokens[i])" 106 | ] 107 | }, 108 | { 109 | "cell_type": "markdown", 110 | "id": "f501572a", 111 | "metadata": {}, 112 | "source": [ 113 | "构建词汇表类:词元的类型为字符串,而模型需要的输入为数字,因此单纯的词元并不适合输入模型进行训练,需要将词元映射到从0开始的数字索引当中。首先需要先将所有文本合并到一起,接着对每个唯一的词元的出现频率进行统计,统计结果被称为语料库(corpus),然后为每个唯一词元的出现频率分配一个数字索引。很少出现的词元将被删除以降低复杂性。并且对于不存在语料库中的词元或者已经删除的词元都将被映射到一个未知词元中。通常地,可以人为地增加一个列表,用于保存那些被保留的词元,例如序列开始次元表示一个句子的开始,序列结束词元表示一个句子的结束。" 114 | ] 115 | }, 116 | { 117 | "cell_type": "code", 118 | "execution_count": 18, 119 | "id": "ab738886", 120 | "metadata": {}, 121 | "outputs": [], 122 | "source": [ 123 | "class Vocab:\n", 124 | " def __init__(self, tokens=None, mini_freq=0, reserved_token=None):\n", 125 | " \"\"\"文本词汇表\"\"\"\n", 126 | " if(tokens is None):\n", 127 | " tokens = [ ]\n", 128 | " if(reserved_token is None):\n", 129 | " reserved_token = [ ]\n", 130 | " counter = corpus_counter(tokens) #计算词元频率构造语料库\n", 131 | " self.token_freq = sorted(counter.items(), key=lambda x:x[1], reverse=True) #将词元频率按照出现频率从高到低排列\n", 132 | " \n", 133 | " self.unk, uniq_tokens = 0, [''] + reserved_token #构造一个能够存放词元的字典\n", 134 | " #对于语料库中出现频率满足设定的最小频率的词元以及不在字典中的词元,逐个将这些满足条件的词元放入字典中。\n", 135 | " uniq_tokens += [token for token, freq in self.token_freq if freq >= mini_freq and token not in uniq_tokens] \n", 136 | " self.token_to_idx = dict() #给定词元返回数字索引\n", 137 | " self.idx_to_token = [ ] #给定数字索引返回词元\n", 138 | " #将数字索引和字典中的词元一一对应\n", 139 | " for token in uniq_tokens:\n", 140 | " self.idx_to_token.append(token)\n", 141 | " self.token_to_idx[token] = len(self.idx_to_token) - 1\n", 142 | " \n", 143 | " def __len__(self):\n", 144 | " \"\"\"返回储存词元字典的长度\"\"\"\n", 145 | " return len(self.idx_to_token) \n", 146 | " \n", 147 | " def __getitem__(self, tokens):\n", 148 | " \"\"\"输入一个词元,返回一个数字索引\"\"\"\n", 149 | " if not isinstance(tokens, (list, tuple)):\n", 150 | " return self.token_to_idx.get(tokens, self.unk)\n", 151 | " return [self.__getitem__(token) for token in tokens]\n", 152 | " \n", 153 | " def to_token(self, indices):\n", 154 | " \"\"\"输入一个数字索引,返回一个词元\"\"\"\n", 155 | " if not isinstance(indices, (list, tuple)):\n", 156 | " return self.idx_to_token.get[indices]\n", 157 | " return [self.to_token[idx] for idx in indices]\n", 158 | " \n", 159 | "def corpus_counter(tokens):\n", 160 | " \"\"\"统计词频\"\"\"\n", 161 | " if (len(tokens)==0 or isinstance(tokens[0], list)): \n", 162 | " \"\"\"将词元映射到数字索引中以统计词元的出现频率\"\"\"\n", 163 | " tokens = [token for line in tokens for token in line]\n", 164 | " return collections.Counter(tokens)" 165 | ] 166 | }, 167 | { 168 | "cell_type": "code", 169 | "execution_count": 14, 170 | "id": "d71fa711", 171 | "metadata": {}, 172 | "outputs": [ 173 | { 174 | "name": "stdout", 175 | "output_type": "stream", 176 | "text": [ 177 | "[('', 0), ('the', 1), ('i', 2), ('and', 3), ('of', 4), ('a', 5), ('to', 6), ('was', 7), ('in', 8), ('that', 9)]\n" 178 | ] 179 | } 180 | ], 181 | "source": [ 182 | "vocab = Vocab(tokens)\n", 183 | "print(list(vocab.token_to_idx.items())[:10])" 184 | ] 185 | }, 186 | { 187 | "cell_type": "code", 188 | "execution_count": 16, 189 | "id": "0fa274dc", 190 | "metadata": {}, 191 | "outputs": [ 192 | { 193 | "data": { 194 | "text/plain": [ 195 | "<__main__.Vocab at 0x17b93959b50>" 196 | ] 197 | }, 198 | "execution_count": 16, 199 | "metadata": {}, 200 | "output_type": "execute_result" 201 | } 202 | ], 203 | "source": [ 204 | "vocab" 205 | ] 206 | }, 207 | { 208 | "cell_type": "markdown", 209 | "id": "31356f99", 210 | "metadata": {}, 211 | "source": [ 212 | "将文本行转为数字索引列表" 213 | ] 214 | }, 215 | { 216 | "cell_type": "code", 217 | "execution_count": 15, 218 | "id": "ad02d78c", 219 | "metadata": {}, 220 | "outputs": [ 221 | { 222 | "name": "stdout", 223 | "output_type": "stream", 224 | "text": [ 225 | "word: ['the', 'time', 'machine', 'by', 'h', 'g', 'wells']\n", 226 | "index: [1, 19, 50, 40, 2183, 2184, 400]\n", 227 | "word: ['twinkled', 'and', 'his', 'usually', 'pale', 'face', 'was', 'flushed', 'and', 'animated', 'the']\n", 228 | "index: [2186, 3, 25, 1044, 362, 113, 7, 1421, 3, 1045, 1]\n" 229 | ] 230 | } 231 | ], 232 | "source": [ 233 | "for i in [0, 10]:\n", 234 | " print('word:',tokens[i])\n", 235 | " print('index:',vocab[tokens[i]])" 236 | ] 237 | }, 238 | { 239 | "cell_type": "code", 240 | "execution_count": 20, 241 | "id": "5d8f1f8b", 242 | "metadata": {}, 243 | "outputs": [ 244 | { 245 | "data": { 246 | "text/plain": [ 247 | "(28, 170580)" 248 | ] 249 | }, 250 | "execution_count": 20, 251 | "metadata": {}, 252 | "output_type": "execute_result" 253 | } 254 | ], 255 | "source": [ 256 | "def load_corpus_time_machine(max_token=-1):\n", 257 | " lines = read_book()\n", 258 | " tokens = tokenize(lines, 'char')\n", 259 | " vocabu = Vocab(tokens)\n", 260 | " corpus = [vocabu[token] for line in tokens for token in line]\n", 261 | " \n", 262 | " if (max_token > 0):\n", 263 | " corpus = corpus[:max_token]\n", 264 | " return vocabu, corpus\n", 265 | "\n", 266 | "vocabu, corpus = load_corpus_time_machine()\n", 267 | "len(vocabu), len(corpus)" 268 | ] 269 | }, 270 | { 271 | "cell_type": "code", 272 | "execution_count": null, 273 | "id": "c896d67f", 274 | "metadata": {}, 275 | "outputs": [], 276 | "source": [] 277 | } 278 | ], 279 | "metadata": { 280 | "kernelspec": { 281 | "display_name": "Python [conda env:torch] *", 282 | "language": "python", 283 | "name": "conda-env-torch-py" 284 | }, 285 | "language_info": { 286 | "codemirror_mode": { 287 | "name": "ipython", 288 | "version": 3 289 | }, 290 | "file_extension": ".py", 291 | "mimetype": "text/x-python", 292 | "name": "python", 293 | "nbconvert_exporter": "python", 294 | "pygments_lexer": "ipython3", 295 | "version": "3.8.16" 296 | } 297 | }, 298 | "nbformat": 4, 299 | "nbformat_minor": 5 300 | } 301 | -------------------------------------------------------------------------------- /李沐DeepLearning/机器翻译数据集.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "527b12d7", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import torch\n", 11 | "import os\n", 12 | "from d2l import torch as d2l" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "cb6703e5", 18 | "metadata": {}, 19 | "source": [ 20 | "下载数据集,数据集中的每一行都是制表符分隔的文本序列对, 序列对由英文文本序列和翻译后的法语文本序列组成。 请注意,每个文本序列可以是一个句子, 也可以是包含多个句子的一个段落。 在这个将英语翻译成法语的机器翻译问题中, 英语是源语言(source language), 法语是目标语言(target language)。" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 2, 26 | "id": "23f592b7", 27 | "metadata": {}, 28 | "outputs": [ 29 | { 30 | "name": "stdout", 31 | "output_type": "stream", 32 | "text": [ 33 | "Go.\tVa !\n", 34 | "Hi.\tSalut !\n", 35 | "Run!\tCours !\n", 36 | "Run!\tCourez !\n", 37 | "Who?\tQui ?\n", 38 | "Wow!\tÇa alors !\n", 39 | "\n" 40 | ] 41 | } 42 | ], 43 | "source": [ 44 | "d2l.DATA_HUB['fra-eng'] = (d2l.DATA_URL + 'fra-eng.zip', '94646ad1522d915e7b0f9296181140edcf86a4f5')\n", 45 | "\n", 46 | "def read_data():\n", 47 | " data_dir = d2l.download_extract('fra-eng')\n", 48 | " with open(os.path.join(data_dir, 'fra.txt'), 'r', encoding='utf-8') as f:\n", 49 | " return f.read()\n", 50 | "\n", 51 | "raw_txt = read_data()\n", 52 | "print(raw_txt[:75])" 53 | ] 54 | }, 55 | { 56 | "cell_type": "markdown", 57 | "id": "38ecbb95", 58 | "metadata": {}, 59 | "source": [ 60 | "对数据集进行预处理" 61 | ] 62 | }, 63 | { 64 | "cell_type": "code", 65 | "execution_count": 3, 66 | "id": "87ae94ce", 67 | "metadata": {}, 68 | "outputs": [ 69 | { 70 | "name": "stdout", 71 | "output_type": "stream", 72 | "text": [ 73 | "go .\tva !\n", 74 | "hi .\tsalut !\n", 75 | "run !\tcours !\n", 76 | "run !\tcourez !\n", 77 | "who ?\tqui ?\n", 78 | "wow !\tça alors !\n" 79 | ] 80 | } 81 | ], 82 | "source": [ 83 | "def preprocess(text):\n", 84 | " # 将标点符号分离出来\n", 85 | " def no_space(char, prev_char):\n", 86 | " return char in set (',.!?') and prev_char != ' '\n", 87 | " \n", 88 | " # 将不连续空格替换成空格,将大写换成小写\n", 89 | " text = text.replace('\\u202f', ' ').replace('\\xa0', ' ').lower()\n", 90 | " \n", 91 | " # 在标点符号前面添加空格\n", 92 | " out = [' '+ char if i>0 and no_space(char, text[i-1]) else char for i, char in enumerate(text)]\n", 93 | " return ''.join(out)\n", 94 | "\n", 95 | "text = preprocess(raw_txt)\n", 96 | "print(text[:80])" 97 | ] 98 | }, 99 | { 100 | "cell_type": "markdown", 101 | "id": "fec7663a", 102 | "metadata": {}, 103 | "source": [ 104 | "词元化" 105 | ] 106 | }, 107 | { 108 | "cell_type": "code", 109 | "execution_count": 4, 110 | "id": "ce82cedc", 111 | "metadata": {}, 112 | "outputs": [ 113 | { 114 | "data": { 115 | "text/plain": [ 116 | "([['go', '.'],\n", 117 | " ['hi', '.'],\n", 118 | " ['run', '!'],\n", 119 | " ['run', '!'],\n", 120 | " ['who', '?'],\n", 121 | " ['wow', '!']],\n", 122 | " [['va', '!'],\n", 123 | " ['salut', '!'],\n", 124 | " ['cours', '!'],\n", 125 | " ['courez', '!'],\n", 126 | " ['qui', '?'],\n", 127 | " ['ça', 'alors', '!']])" 128 | ] 129 | }, 130 | "execution_count": 4, 131 | "metadata": {}, 132 | "output_type": "execute_result" 133 | } 134 | ], 135 | "source": [ 136 | "def tokenize(text, num_examples=None):\n", 137 | " # 构造源词元表和目标词元表\n", 138 | " source, target = [], []\n", 139 | " # 先按照行进行分离\n", 140 | " for i, line in enumerate(text.split('\\n')):\n", 141 | " # 判断是否超出范围\n", 142 | " if num_examples and i>num_examples:\n", 143 | " break\n", 144 | " # 按分隔间距进行分词\n", 145 | " parts = line.split('\\t')\n", 146 | " # 若长度为2则说明里面仅包含一个英语词和一个法语词,按对压入源词元表和目标词元表\n", 147 | " if len(parts) == 2:\n", 148 | " source.append(parts[0].split(' '))\n", 149 | " target.append(parts[1].split(' '))\n", 150 | " return source, target\n", 151 | "\n", 152 | "src, tar = tokenize(text)\n", 153 | "src[:6], tar[:6]" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "id": "326564a7", 159 | "metadata": {}, 160 | "source": [ 161 | "画出每个文本序列含有多少词元" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 7, 167 | "id": "dd50052b", 168 | "metadata": {}, 169 | "outputs": [ 170 | { 171 | "data": { 172 | "image/svg+xml": [ 173 | "\n", 174 | "\n", 176 | "\n", 177 | " \n", 178 | " \n", 179 | " \n", 180 | " \n", 181 | " 2023-06-05T17:06:13.511828\n", 182 | " image/svg+xml\n", 183 | " \n", 184 | " \n", 185 | " Matplotlib v3.5.1, https://matplotlib.org/\n", 186 | " \n", 187 | " \n", 188 | " \n", 189 | " \n", 190 | " \n", 191 | " \n", 192 | " \n", 193 | " \n", 194 | " \n", 195 | " \n", 196 | " \n", 202 | " \n", 203 | " \n", 204 | " \n", 205 | " \n", 211 | " \n", 212 | " \n", 213 | " \n", 219 | " \n", 220 | " \n", 221 | " \n", 227 | " \n", 228 | " \n", 229 | " \n", 235 | " \n", 236 | " \n", 237 | " \n", 243 | " \n", 244 | " \n", 245 | " \n", 251 | " \n", 252 | " \n", 253 | " \n", 259 | " \n", 260 | " \n", 261 | " \n", 267 | " \n", 268 | " \n", 269 | " \n", 275 | " \n", 276 | " \n", 277 | " \n", 283 | " \n", 284 | " \n", 285 | " \n", 291 | " \n", 292 | " \n", 293 | " \n", 299 | " \n", 300 | " \n", 301 | " \n", 307 | " \n", 308 | " \n", 309 | " \n", 315 | " \n", 316 | " \n", 317 | " \n", 323 | " \n", 324 | " \n", 325 | " \n", 331 | " \n", 332 | " \n", 333 | " \n", 339 | " \n", 340 | " \n", 341 | " \n", 347 | " \n", 348 | " \n", 349 | " \n", 355 | " \n", 356 | " \n", 357 | " \n", 363 | " \n", 364 | " \n", 365 | " \n", 371 | " \n", 372 | " \n", 373 | " \n", 374 | " \n", 375 | " \n", 376 | " \n", 379 | " \n", 380 | " \n", 381 | " \n", 382 | " \n", 383 | " \n", 384 | " \n", 385 | " \n", 386 | " \n", 387 | " \n", 388 | " \n", 409 | " \n", 410 | " \n", 411 | " \n", 412 | " \n", 413 | " \n", 414 | " \n", 415 | " \n", 416 | " \n", 417 | " \n", 418 | " \n", 419 | " \n", 420 | " \n", 421 | " \n", 422 | " \n", 423 | " \n", 424 | " \n", 448 | " \n", 449 | " \n", 450 | " \n", 451 | " \n", 452 | " \n", 453 | " \n", 454 | " \n", 455 | " \n", 456 | " \n", 457 | " \n", 458 | " \n", 459 | " \n", 460 | " \n", 461 | " \n", 462 | " \n", 463 | " \n", 464 | " \n", 483 | " \n", 484 | " \n", 485 | " \n", 486 | " \n", 487 | " \n", 488 | " \n", 489 | " \n", 490 | " \n", 491 | " \n", 492 | " \n", 493 | " \n", 494 | " \n", 495 | " \n", 496 | " \n", 497 | " \n", 498 | " \n", 499 | " \n", 529 | " \n", 530 | " \n", 531 | " \n", 532 | " \n", 533 | " \n", 534 | " \n", 535 | " \n", 536 | " \n", 537 | " \n", 538 | " \n", 539 | " \n", 560 | " \n", 581 | " \n", 595 | " \n", 620 | " \n", 639 | " \n", 670 | " \n", 671 | " \n", 697 | " \n", 714 | " \n", 740 | " \n", 762 | " \n", 783 | " \n", 784 | " \n", 785 | " \n", 786 | " \n", 787 | " \n", 788 | " \n", 789 | " \n", 790 | " \n", 791 | " \n", 792 | " \n", 793 | " \n", 794 | " \n", 795 | " \n", 796 | " \n", 797 | " \n", 798 | " \n", 799 | " \n", 800 | " \n", 801 | " \n", 802 | " \n", 803 | " \n", 804 | " \n", 805 | " \n", 806 | " \n", 807 | " \n", 808 | " \n", 809 | " \n", 810 | " \n", 813 | " \n", 814 | " \n", 815 | " \n", 816 | " \n", 817 | " \n", 818 | " \n", 819 | " \n", 820 | " \n", 821 | " \n", 822 | " \n", 823 | " \n", 824 | " \n", 825 | " \n", 826 | " \n", 827 | " \n", 828 | " \n", 829 | " \n", 830 | " \n", 831 | " \n", 832 | " \n", 833 | " \n", 834 | " \n", 835 | " \n", 836 | " \n", 837 | " \n", 838 | " \n", 839 | " \n", 840 | " \n", 841 | " \n", 842 | " \n", 843 | " \n", 844 | " \n", 845 | " \n", 846 | " \n", 847 | " \n", 848 | " \n", 849 | " \n", 850 | " \n", 851 | " \n", 852 | " \n", 853 | " \n", 854 | " \n", 855 | " \n", 856 | " \n", 857 | " \n", 858 | " \n", 859 | " \n", 860 | " \n", 861 | " \n", 862 | " \n", 863 | " \n", 864 | " \n", 865 | " \n", 866 | " \n", 867 | " \n", 868 | " \n", 869 | " \n", 870 | " \n", 871 | " \n", 872 | " \n", 873 | " \n", 874 | " \n", 875 | " \n", 876 | " \n", 877 | " \n", 878 | " \n", 879 | " \n", 880 | " \n", 881 | " \n", 882 | " \n", 883 | " \n", 884 | " \n", 885 | " \n", 886 | " \n", 925 | " \n", 926 | " \n", 927 | " \n", 928 | " \n", 929 | " \n", 930 | " \n", 931 | " \n", 932 | " \n", 933 | " \n", 934 | " \n", 935 | " \n", 936 | " \n", 937 | " \n", 938 | " \n", 939 | " \n", 940 | " \n", 941 | " \n", 942 | " \n", 943 | " \n", 944 | " \n", 958 | " \n", 959 | " \n", 960 | " \n", 961 | " \n", 962 | " \n", 963 | " \n", 964 | " \n", 965 | " \n", 966 | " \n", 967 | " \n", 968 | " \n", 969 | " \n", 970 | " \n", 971 | " \n", 972 | " \n", 973 | " \n", 974 | " \n", 975 | " \n", 976 | " \n", 977 | " \n", 978 | " \n", 979 | " \n", 980 | " \n", 983 | " \n", 984 | " \n", 985 | " \n", 988 | " \n", 989 | " \n", 990 | " \n", 993 | " \n", 994 | " \n", 995 | " \n", 998 | " \n", 999 | " \n", 1000 | " \n", 1001 | " \n", 1012 | " \n", 1013 | " \n", 1014 | " \n", 1020 | " \n", 1021 | " \n", 1022 | " \n", 1023 | " \n", 1024 | " \n", 1025 | " \n", 1026 | " \n", 1027 | " \n", 1028 | " \n", 1029 | " \n", 1030 | " \n", 1031 | " \n", 1032 | " \n", 1033 | " \n", 1039 | " \n", 1040 | " \n", 1041 | " \n", 1042 | " \n", 1043 | " \n", 1044 | " \n", 1077 | " \n", 1111 | " \n", 1112 | " \n", 1113 | " \n", 1114 | " \n", 1115 | " \n", 1116 | " \n", 1117 | " \n", 1118 | " \n", 1119 | " \n", 1120 | " \n", 1121 | " \n", 1122 | " \n", 1123 | " \n", 1124 | " \n", 1125 | " \n", 1126 | " \n", 1127 | " \n", 1128 | " \n", 1129 | " \n", 1130 | " \n", 1131 | " \n", 1146 | " \n", 1147 | " \n", 1148 | "\n" 1149 | ], 1150 | "text/plain": [ 1151 | "
" 1152 | ] 1153 | }, 1154 | "metadata": {}, 1155 | "output_type": "display_data" 1156 | } 1157 | ], 1158 | "source": [ 1159 | "def show_tokens_per_seq(legend, xlabel, ylabel, xlist, ylist):\n", 1160 | " d2l.set_figsize()\n", 1161 | " _, _, patches = d2l.plt.hist([[len(l) for l in xlist], [len(l) for l in ylist]])\n", 1162 | " d2l.plt.xlabel(xlabel)\n", 1163 | " d2l.plt.ylabel(ylabel)\n", 1164 | " for patch in patches[1].patches:\n", 1165 | " patch.set_hatch('/')\n", 1166 | " d2l.plt.legend(legend)\n", 1167 | "\n", 1168 | "show_tokens_per_seq(['source', 'target'], 'tokens per sequence', 'count', src, tar)" 1169 | ] 1170 | }, 1171 | { 1172 | "cell_type": "markdown", 1173 | "id": "45b138e9", 1174 | "metadata": {}, 1175 | "source": [ 1176 | "构造一个源语言的词汇表" 1177 | ] 1178 | }, 1179 | { 1180 | "cell_type": "code", 1181 | "execution_count": 41, 1182 | "id": "b0d35122", 1183 | "metadata": {}, 1184 | "outputs": [ 1185 | { 1186 | "data": { 1187 | "text/plain": [ 1188 | "10012" 1189 | ] 1190 | }, 1191 | "execution_count": 41, 1192 | "metadata": {}, 1193 | "output_type": "execute_result" 1194 | } 1195 | ], 1196 | "source": [ 1197 | "src_vocab1 = d2l.Vocab(src, min_freq = 2, reserved_tokens = ['', '', ''])\n", 1198 | "len(src_vocab)" 1199 | ] 1200 | }, 1201 | { 1202 | "cell_type": "markdown", 1203 | "id": "77911d0c", 1204 | "metadata": {}, 1205 | "source": [ 1206 | "构造一个函数能在序列中制造小批量数据:给定一个时间步长num_steps,一个小批量数据内的每个序列的长度都应该为num_steps。若一个序列的的长度大于num_steps,就应该截取前num_steps个词元组成序列;若一个序列的长度小于num_steps,则应该在其尾部补充词元数量直至满足num_steps的长度。" 1207 | ] 1208 | }, 1209 | { 1210 | "cell_type": "code", 1211 | "execution_count": 37, 1212 | "id": "ccac8888", 1213 | "metadata": {}, 1214 | "outputs": [ 1215 | { 1216 | "data": { 1217 | "text/plain": [ 1218 | "[47, 4, 1, 1, 1, 1, 1, 1, 1, 1]" 1219 | ] 1220 | }, 1221 | "execution_count": 37, 1222 | "metadata": {}, 1223 | "output_type": "execute_result" 1224 | } 1225 | ], 1226 | "source": [ 1227 | "def truncate_pad(line, num_steps, padding_token):\n", 1228 | " if len(line) > num_steps:\n", 1229 | " line = line[:num_steps]\n", 1230 | " else:\n", 1231 | " line = line + [padding_token] * (num_steps - len(line))\n", 1232 | " return line\n", 1233 | "\n", 1234 | "truncate_pad(src_vocab[src[0]], 10, src_vocab[''])" 1235 | ] 1236 | }, 1237 | { 1238 | "cell_type": "code", 1239 | "execution_count": 43, 1240 | "id": "d8850cc2", 1241 | "metadata": {}, 1242 | "outputs": [], 1243 | "source": [ 1244 | "def array_build(lines, vocab, num_steps):\n", 1245 | " lines = [vocab[l] for l in lines]\n", 1246 | " lines = [l + [vocab['']] for l in lines]\n", 1247 | " array = torch.tensor([truncate_pad(l, num_steps, vocab['']) for l in lines])\n", 1248 | " valid_len = (array != vocab['']).type(torch.int32).sum(1)\n", 1249 | " return array, valid_len" 1250 | ] 1251 | }, 1252 | { 1253 | "cell_type": "code", 1254 | "execution_count": 44, 1255 | "id": "2c25c5c3", 1256 | "metadata": {}, 1257 | "outputs": [], 1258 | "source": [ 1259 | "def load_data(batch_size, num_steps, num_examples=600):\n", 1260 | " # 文本序列预处理\n", 1261 | " text = preprocess(read_data())\n", 1262 | " # 词元化\n", 1263 | " source, target = tokenize(text, num_examples)\n", 1264 | " src_vocab = d2l.Vocab(source, min_freq=2, reserved_tokens = ['', '', ''])\n", 1265 | " tar_vocab = d2l.Vocab(target, min_freq=2, reserved_tokens = ['', '', ''])\n", 1266 | " src_array, src_valid_len = array_build(source, src_vocab, num_steps)\n", 1267 | " tar_array, tar_valid_len = array_build(target, tar_vocab, num_steps)\n", 1268 | " data_array = (src_array, src_valid_len, tar_array, tar_valid_len)\n", 1269 | " data_iter = d2l.load_array(data_array, batch_size)\n", 1270 | " return src_vocab, tar_vocab, data_iter" 1271 | ] 1272 | }, 1273 | { 1274 | "cell_type": "code", 1275 | "execution_count": 45, 1276 | "id": "c3dcd1a0", 1277 | "metadata": { 1278 | "scrolled": true 1279 | }, 1280 | "outputs": [ 1281 | { 1282 | "name": "stdout", 1283 | "output_type": "stream", 1284 | "text": [ 1285 | "X: tensor([[10, 73, 4, 3, 1, 1, 1, 1],\n", 1286 | " [14, 27, 4, 3, 1, 1, 1, 1]], dtype=torch.int32)\n", 1287 | "X的有效长度: tensor([4, 4])\n", 1288 | "Y: tensor([[ 8, 0, 4, 3, 1, 1, 1, 1],\n", 1289 | " [26, 58, 5, 3, 1, 1, 1, 1]], dtype=torch.int32)\n", 1290 | "Y的有效长度: tensor([4, 4])\n" 1291 | ] 1292 | } 1293 | ], 1294 | "source": [ 1295 | "src_vocab, tar_vocab, train_iter = load_data(batch_size=2, num_steps=8)\n", 1296 | "for X, X_valid_len, Y, Y_valid_len in train_iter:\n", 1297 | " print('X:', X.type(torch.int32))\n", 1298 | " print('X的有效长度:', X_valid_len)\n", 1299 | " print('Y:', Y.type(torch.int32))\n", 1300 | " print('Y的有效长度:', Y_valid_len)\n", 1301 | " break" 1302 | ] 1303 | }, 1304 | { 1305 | "cell_type": "code", 1306 | "execution_count": null, 1307 | "id": "80d01070", 1308 | "metadata": {}, 1309 | "outputs": [], 1310 | "source": [] 1311 | } 1312 | ], 1313 | "metadata": { 1314 | "kernelspec": { 1315 | "display_name": "Python [conda env:torch] *", 1316 | "language": "python", 1317 | "name": "conda-env-torch-py" 1318 | }, 1319 | "language_info": { 1320 | "codemirror_mode": { 1321 | "name": "ipython", 1322 | "version": 3 1323 | }, 1324 | "file_extension": ".py", 1325 | "mimetype": "text/x-python", 1326 | "name": "python", 1327 | "nbconvert_exporter": "python", 1328 | "pygments_lexer": "ipython3", 1329 | "version": "3.8.16" 1330 | } 1331 | }, 1332 | "nbformat": 4, 1333 | "nbformat_minor": 5 1334 | } 1335 | -------------------------------------------------------------------------------- /李沐DeepLearning/线性回归.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 2, 6 | "id": "232a5455", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "%matplotlib inline\n", 11 | "import torch\n", 12 | "import random" 13 | ] 14 | }, 15 | { 16 | "cell_type": "markdown", 17 | "id": "81fe2d86", 18 | "metadata": {}, 19 | "source": [ 20 | "首先构造参数为w和b以及带有一个噪声项$\\epsilon$的人造数据集**y** = **X** **w** + b + $\\epsilon$" 21 | ] 22 | }, 23 | { 24 | "cell_type": "code", 25 | "execution_count": 7, 26 | "id": "6fe5f7df", 27 | "metadata": {}, 28 | "outputs": [], 29 | "source": [ 30 | "def synthentic_data(w, b, num_samples):\n", 31 | " X = torch.normal(0, 1, (num_samples, len(w))) ##X为均值为0方差为1的随机数,数量和参数w的数量一致\n", 32 | " y = torch.matmul(X, w) + b \n", 33 | " y += torch.normal(0, 0.01, y.shape)\n", 34 | " return X, y.reshape(-1, 1) ##将x和y的形状都变为列向量\n", 35 | " \n", 36 | "w_true = torch.tensor([2, -3.4])\n", 37 | "b_true = 4.2\n", 38 | "\n", 39 | "features, labels = synthentic_data(w_true, b_true, 1000)" 40 | ] 41 | }, 42 | { 43 | "cell_type": "code", 44 | "execution_count": 9, 45 | "id": "1e7af7c4", 46 | "metadata": {}, 47 | "outputs": [ 48 | { 49 | "name": "stdout", 50 | "output_type": "stream", 51 | "text": [ 52 | "features: tensor([-0.8105, -0.3568]) \n", 53 | "label: tensor([3.8019])\n" 54 | ] 55 | } 56 | ], 57 | "source": [ 58 | "print('features:', features[0], '\\nlabel:', labels[0])" 59 | ] 60 | }, 61 | { 62 | "cell_type": "markdown", 63 | "id": "5cb47cac", 64 | "metadata": {}, 65 | "source": [ 66 | "构造一个函数能够每次随机小批量地从数据集中采样" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": 12, 72 | "id": "56908aa6", 73 | "metadata": {}, 74 | "outputs": [ 75 | { 76 | "name": "stdout", 77 | "output_type": "stream", 78 | "text": [ 79 | "tensor([[ 0.4418, -0.2143],\n", 80 | " [-1.8064, 0.6571],\n", 81 | " [ 0.7412, 0.1856],\n", 82 | " [ 0.6559, 0.8517],\n", 83 | " [-0.6508, -0.8636],\n", 84 | " [-0.2165, 0.0563],\n", 85 | " [-0.1913, -0.6041],\n", 86 | " [-2.3457, 1.4957],\n", 87 | " [-1.1084, -1.6875],\n", 88 | " [ 0.9823, 0.6570]]) \n", 89 | " tensor([[ 5.8165],\n", 90 | " [-1.6394],\n", 91 | " [ 5.0659],\n", 92 | " [ 2.6219],\n", 93 | " [ 5.8414],\n", 94 | " [ 3.5669],\n", 95 | " [ 5.8631],\n", 96 | " [-5.5850],\n", 97 | " [ 7.7229],\n", 98 | " [ 3.9221]])\n" 99 | ] 100 | } 101 | ], 102 | "source": [ 103 | "def data_batch(batch_size, feature, label):\n", 104 | " num_example = len(feature)\n", 105 | " induice = list(range(num_example))\n", 106 | " random.shuffle(induice)\n", 107 | " \n", 108 | " for i in range(0, num_example, batch_size):\n", 109 | " batch_induice = torch.tensor(induice[i:min(i + batch_size,num_example)])\n", 110 | " yield feature[batch_induice], label[batch_induice]\n", 111 | " \n", 112 | "batch_size = 10\n", 113 | "for X, y in data_batch(batch_size, features, labels):\n", 114 | " print (X, '\\n', y)\n", 115 | " break" 116 | ] 117 | }, 118 | { 119 | "cell_type": "markdown", 120 | "id": "7d4bdc1c", 121 | "metadata": {}, 122 | "source": [ 123 | "初始化模型参数w和b" 124 | ] 125 | }, 126 | { 127 | "cell_type": "code", 128 | "execution_count": 16, 129 | "id": "f191d148", 130 | "metadata": {}, 131 | "outputs": [], 132 | "source": [ 133 | "w = torch.normal(0, 0.01, size=(2,1), requires_grad=True)\n", 134 | "b = torch.zeros(1, requires_grad=True)" 135 | ] 136 | }, 137 | { 138 | "cell_type": "markdown", 139 | "id": "bdeb2478", 140 | "metadata": {}, 141 | "source": [ 142 | "构造线性模型" 143 | ] 144 | }, 145 | { 146 | "cell_type": "code", 147 | "execution_count": 17, 148 | "id": "242d2b86", 149 | "metadata": {}, 150 | "outputs": [], 151 | "source": [ 152 | "def linear_reg(X, w, b):\n", 153 | " return torch.matmul(X, w) + b" 154 | ] 155 | }, 156 | { 157 | "cell_type": "markdown", 158 | "id": "ec33debf", 159 | "metadata": {}, 160 | "source": [ 161 | "定义损失函数" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": 21, 167 | "id": "afd4dbad", 168 | "metadata": {}, 169 | "outputs": [], 170 | "source": [ 171 | "##损失函数为MSE\n", 172 | "def loss_func(y, y_hat):\n", 173 | " return (y_hat - y.reshape(y_hat.shape)) ** 2 / 2" 174 | ] 175 | }, 176 | { 177 | "cell_type": "markdown", 178 | "id": "d07ee339", 179 | "metadata": {}, 180 | "source": [ 181 | "定义优化器" 182 | ] 183 | }, 184 | { 185 | "cell_type": "code", 186 | "execution_count": 19, 187 | "id": "4fb8dbb2", 188 | "metadata": {}, 189 | "outputs": [], 190 | "source": [ 191 | "##优化器为随机梯度下降SGD\n", 192 | "def SGD(params, learning_rate, batch_size):\n", 193 | " with torch.no_grad():\n", 194 | " for param in params:\n", 195 | " param -= learning_rate * param.grad / batch_size\n", 196 | " param.grad.zero_()" 197 | ] 198 | }, 199 | { 200 | "cell_type": "code", 201 | "execution_count": 22, 202 | "id": "84c5931e", 203 | "metadata": {}, 204 | "outputs": [ 205 | { 206 | "name": "stdout", 207 | "output_type": "stream", 208 | "text": [ 209 | "epoch 1, loss 0.055980\n", 210 | "epoch 2, loss 0.000252\n", 211 | "epoch 3, loss 0.000050\n", 212 | "epoch 4, loss 0.000049\n", 213 | "epoch 5, loss 0.000049\n", 214 | "epoch 6, loss 0.000049\n", 215 | "epoch 7, loss 0.000049\n", 216 | "epoch 8, loss 0.000049\n", 217 | "epoch 9, loss 0.000049\n", 218 | "epoch 10, loss 0.000049\n" 219 | ] 220 | } 221 | ], 222 | "source": [ 223 | "learning_rate = 0.03\n", 224 | "net = linear_reg\n", 225 | "loss = loss_func\n", 226 | "epochs = 10\n", 227 | "\n", 228 | "for epoch in range(epochs):\n", 229 | " for X,y in data_batch(batch_size, features, labels):\n", 230 | " loss = loss_func(net(X, w, b), y)\n", 231 | " loss.sum().backward()\n", 232 | " SGD([w,b], learning_rate, batch_size)\n", 233 | " \n", 234 | " with torch.no_grad():\n", 235 | " train_loss = loss_func(net(features,w,b), labels)\n", 236 | " print (f'epoch { epoch + 1 }, loss {float(train_loss.mean()):f}')" 237 | ] 238 | }, 239 | { 240 | "cell_type": "markdown", 241 | "id": "b74f8bfc", 242 | "metadata": {}, 243 | "source": [ 244 | "使用pytorch设定好的内置函数实现" 245 | ] 246 | }, 247 | { 248 | "cell_type": "code", 249 | "execution_count": 27, 250 | "id": "e3152bb2", 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "from torch.utils import data\n", 255 | "import numpy as np" 256 | ] 257 | }, 258 | { 259 | "cell_type": "markdown", 260 | "id": "1d886f0b", 261 | "metadata": {}, 262 | "source": [ 263 | "构造自定义数据集" 264 | ] 265 | }, 266 | { 267 | "cell_type": "code", 268 | "execution_count": 32, 269 | "id": "02227ccb", 270 | "metadata": {}, 271 | "outputs": [ 272 | { 273 | "data": { 274 | "text/plain": [ 275 | "[tensor([[-1.6186, 1.1914],\n", 276 | " [-0.7390, 0.7205],\n", 277 | " [ 0.9826, 0.6103],\n", 278 | " [ 0.8132, -0.0249],\n", 279 | " [-0.4938, 0.8550],\n", 280 | " [-0.0217, 0.4927],\n", 281 | " [ 0.8233, 0.3651],\n", 282 | " [ 0.3465, -0.4650],\n", 283 | " [ 0.0432, -0.1148],\n", 284 | " [ 0.4177, 0.7377]]),\n", 285 | " tensor([[-3.0947],\n", 286 | " [ 0.2740],\n", 287 | " [ 4.0838],\n", 288 | " [ 5.9232],\n", 289 | " [ 0.3082],\n", 290 | " [ 2.4810],\n", 291 | " [ 4.6042],\n", 292 | " [ 6.4671],\n", 293 | " [ 4.6845],\n", 294 | " [ 2.5223]])]" 295 | ] 296 | }, 297 | "execution_count": 32, 298 | "metadata": {}, 299 | "output_type": "execute_result" 300 | } 301 | ], 302 | "source": [ 303 | "def Dataset(Data, batch_size, is_train=True):\n", 304 | " dataset = data.TensorDataset(*Data)\n", 305 | " dataloader = data.DataLoader(dataset, batch_size, shuffle=is_train)\n", 306 | " return dataloader\n", 307 | "\n", 308 | "batch_size = 10\n", 309 | "DataSet = Dataset((features, labels), batch_size)\n", 310 | "\n", 311 | "next(iter(DataSet))" 312 | ] 313 | }, 314 | { 315 | "cell_type": "markdown", 316 | "id": "d8c7f994", 317 | "metadata": {}, 318 | "source": [ 319 | "定义网络" 320 | ] 321 | }, 322 | { 323 | "cell_type": "code", 324 | "execution_count": 37, 325 | "id": "414e3749", 326 | "metadata": {}, 327 | "outputs": [], 328 | "source": [ 329 | "import torch.nn as nn\n", 330 | "\n", 331 | "reg = nn.Sequential(nn.Linear(2,1))" 332 | ] 333 | }, 334 | { 335 | "cell_type": "markdown", 336 | "id": "047d73cc", 337 | "metadata": {}, 338 | "source": [ 339 | "初始化权重" 340 | ] 341 | }, 342 | { 343 | "cell_type": "code", 344 | "execution_count": 35, 345 | "id": "b33bc9d8", 346 | "metadata": {}, 347 | "outputs": [ 348 | { 349 | "data": { 350 | "text/plain": [ 351 | "tensor([0.])" 352 | ] 353 | }, 354 | "execution_count": 35, 355 | "metadata": {}, 356 | "output_type": "execute_result" 357 | } 358 | ], 359 | "source": [ 360 | "net[0].weight.data.normal_(0, 0.01)\n", 361 | "net[0].bias.data.fill_(0)" 362 | ] 363 | }, 364 | { 365 | "cell_type": "markdown", 366 | "id": "ddf346e7", 367 | "metadata": {}, 368 | "source": [ 369 | "定义损失函数" 370 | ] 371 | }, 372 | { 373 | "cell_type": "code", 374 | "execution_count": 36, 375 | "id": "571b96fd", 376 | "metadata": {}, 377 | "outputs": [], 378 | "source": [ 379 | "Loss = nn.MSELoss()" 380 | ] 381 | }, 382 | { 383 | "cell_type": "markdown", 384 | "id": "211aa4e7", 385 | "metadata": {}, 386 | "source": [ 387 | "构造优化器" 388 | ] 389 | }, 390 | { 391 | "cell_type": "code", 392 | "execution_count": 38, 393 | "id": "79c30d42", 394 | "metadata": {}, 395 | "outputs": [], 396 | "source": [ 397 | "optimizer = torch.optim.SGD(net.parameters(), lr=0.03)" 398 | ] 399 | }, 400 | { 401 | "cell_type": "code", 402 | "execution_count": 40, 403 | "id": "9485620b", 404 | "metadata": {}, 405 | "outputs": [ 406 | { 407 | "name": "stdout", 408 | "output_type": "stream", 409 | "text": [ 410 | " epoch 1, Loss 0.000420\n", 411 | " epoch 2, Loss 0.000099\n", 412 | " epoch 3, Loss 0.000099\n" 413 | ] 414 | } 415 | ], 416 | "source": [ 417 | "num_epochs = 3\n", 418 | "\n", 419 | "for epoch in range(num_epochs):\n", 420 | " for X, y in DataSet:\n", 421 | " train_loss = Loss(net(X), y)\n", 422 | " optimizer.zero_grad()\n", 423 | " train_loss.backward()\n", 424 | " optimizer.step()\n", 425 | " train_loss = Loss(net(features), labels)\n", 426 | " print(f' epoch {epoch + 1}, Loss {train_loss:f}')" 427 | ] 428 | }, 429 | { 430 | "cell_type": "code", 431 | "execution_count": null, 432 | "id": "5da10112", 433 | "metadata": {}, 434 | "outputs": [], 435 | "source": [] 436 | } 437 | ], 438 | "metadata": { 439 | "kernelspec": { 440 | "display_name": "Python [conda env:torch] *", 441 | "language": "python", 442 | "name": "conda-env-torch-py" 443 | }, 444 | "language_info": { 445 | "codemirror_mode": { 446 | "name": "ipython", 447 | "version": 3 448 | }, 449 | "file_extension": ".py", 450 | "mimetype": "text/x-python", 451 | "name": "python", 452 | "nbconvert_exporter": "python", 453 | "pygments_lexer": "ipython3", 454 | "version": "3.8.16" 455 | } 456 | }, 457 | "nbformat": 4, 458 | "nbformat_minor": 5 459 | } 460 | -------------------------------------------------------------------------------- /李沐DeepLearning/自动求导.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": 1, 6 | "id": "d01576cb", 7 | "metadata": {}, 8 | "outputs": [], 9 | "source": [ 10 | "import torch" 11 | ] 12 | }, 13 | { 14 | "cell_type": "markdown", 15 | "id": "eed543e0", 16 | "metadata": {}, 17 | "source": [ 18 | "假设现在要对y=2**x** T **x** 中的**x**向量求导" 19 | ] 20 | }, 21 | { 22 | "cell_type": "markdown", 23 | "id": "cfbbc43d", 24 | "metadata": {}, 25 | "source": [ 26 | "先声明一个**x**向量" 27 | ] 28 | }, 29 | { 30 | "cell_type": "code", 31 | "execution_count": 3, 32 | "id": "f2e8857d", 33 | "metadata": {}, 34 | "outputs": [ 35 | { 36 | "data": { 37 | "text/plain": [ 38 | "tensor([0., 1., 2., 3.])" 39 | ] 40 | }, 41 | "execution_count": 3, 42 | "metadata": {}, 43 | "output_type": "execute_result" 44 | } 45 | ], 46 | "source": [ 47 | "x = torch.arange(4.0)\n", 48 | "x" 49 | ] 50 | }, 51 | { 52 | "cell_type": "markdown", 53 | "id": "c9c6a311", 54 | "metadata": {}, 55 | "source": [ 56 | "在计算梯度(求导)之前,需要一个地方来储存梯度" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": 6, 62 | "id": "ca5b3ad0", 63 | "metadata": {}, 64 | "outputs": [ 65 | { 66 | "data": { 67 | "text/plain": [ 68 | "tensor([0., 1., 2., 3.], requires_grad=True)" 69 | ] 70 | }, 71 | "execution_count": 6, 72 | "metadata": {}, 73 | "output_type": "execute_result" 74 | } 75 | ], 76 | "source": [ 77 | "x.requires_grad_(True)\n", 78 | "x" 79 | ] 80 | }, 81 | { 82 | "cell_type": "markdown", 83 | "id": "3dd30e14", 84 | "metadata": {}, 85 | "source": [ 86 | "计算y" 87 | ] 88 | }, 89 | { 90 | "cell_type": "code", 91 | "execution_count": 7, 92 | "id": "ac661ae9", 93 | "metadata": {}, 94 | "outputs": [ 95 | { 96 | "data": { 97 | "text/plain": [ 98 | "tensor(28., grad_fn=)" 99 | ] 100 | }, 101 | "execution_count": 7, 102 | "metadata": {}, 103 | "output_type": "execute_result" 104 | } 105 | ], 106 | "source": [ 107 | "y = 2 * torch.dot(x,x)\n", 108 | "y" 109 | ] 110 | }, 111 | { 112 | "cell_type": "markdown", 113 | "id": "c8b267e4", 114 | "metadata": {}, 115 | "source": [ 116 | "调用反向传播函数来自动计算y关于x每个分量的梯度" 117 | ] 118 | }, 119 | { 120 | "cell_type": "code", 121 | "execution_count": 8, 122 | "id": "be1c4f8b", 123 | "metadata": {}, 124 | "outputs": [ 125 | { 126 | "data": { 127 | "text/plain": [ 128 | "tensor([ 0., 4., 8., 12.])" 129 | ] 130 | }, 131 | "execution_count": 8, 132 | "metadata": {}, 133 | "output_type": "execute_result" 134 | } 135 | ], 136 | "source": [ 137 | "y.backward()\n", 138 | "x.grad" 139 | ] 140 | }, 141 | { 142 | "cell_type": "markdown", 143 | "id": "368da92c", 144 | "metadata": {}, 145 | "source": [ 146 | "pytorch自动求导时会累积梯度,因此在计算另一个x的函数时,需要对之前的梯度清零。" 147 | ] 148 | }, 149 | { 150 | "cell_type": "code", 151 | "execution_count": 9, 152 | "id": "97f73a30", 153 | "metadata": {}, 154 | "outputs": [ 155 | { 156 | "data": { 157 | "text/plain": [ 158 | "tensor([1., 1., 1., 1.])" 159 | ] 160 | }, 161 | "execution_count": 9, 162 | "metadata": {}, 163 | "output_type": "execute_result" 164 | } 165 | ], 166 | "source": [ 167 | "x.grad.zero_()\n", 168 | "y = x.sum()\n", 169 | "y.backward()\n", 170 | "x.grad" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "id": "6d5eb437", 177 | "metadata": {}, 178 | "outputs": [], 179 | "source": [] 180 | } 181 | ], 182 | "metadata": { 183 | "kernelspec": { 184 | "display_name": "Python [conda env:torch] *", 185 | "language": "python", 186 | "name": "conda-env-torch-py" 187 | }, 188 | "language_info": { 189 | "codemirror_mode": { 190 | "name": "ipython", 191 | "version": 3 192 | }, 193 | "file_extension": ".py", 194 | "mimetype": "text/x-python", 195 | "name": "python", 196 | "nbconvert_exporter": "python", 197 | "pygments_lexer": "ipython3", 198 | "version": "3.8.16" 199 | } 200 | }, 201 | "nbformat": 4, 202 | "nbformat_minor": 5 203 | } 204 | -------------------------------------------------------------------------------- /李沐DeepLearning/语言模型/README.md: -------------------------------------------------------------------------------- 1 | ### 一. 语言模型的概念 2 | 在自然语言处理(NLP)中,语言模型是一种用来对语言进行建模的统计模型。其主要目的是计算给定一段文本序列的概率值或对下一个词或字符的预测值。 3 | 4 | 语言模型通常基于概率模型来构建,它考虑了语言的各种特征,例如语法、语义和上下文。具体来说,语言模型可以根据一定的训练数据学习到一个概率分布,该分布可以描述一个给定的文本序列中每个单词出现的概率,或者是下一个单词的预测概率。这些概率可以用来评估一个给定的文本序列是否合理,或者给出一个可能的下一个单词或短语。 5 | 6 | ### 二. 学习语言模型 7 | 首先,从基本的概率规则开始:假设给定一个长度为$T$的文本序列,文本序列中的词元依次为$x_1,x_2,...,x_T$。于是 $x_t(1\le t \le T)$可以被认为是文本序列在时间步$t$的观测或是标签。则语言模型的目标是估计文本序列的联合概率: 8 | 9 | $P(x_1,x_2,...,x_T)=\prod_{t=1}^{T}P(x_t|x_{t-1},...,x_2,x_1)$ (1) 10 | 11 | 为了训练语言模型,需要计算一个单词的概率以及给定前面若干个词的情况下出现某个单词的条件概率。这些概率本质上就是语言模型的参数。 12 | 13 | 在这里列出条件概率的定义: 14 | 15 | 条件概率(Conditional Probability)是指在已知某个事件发生的前提下,另一个事件发生的概率。一般来说,它是指一个事件B在已知事件A发生的条件下发生的概率,用$P(B|A)$来表示。 16 | 17 | 例如,以下为包含了四个单词的文本序列的概率: 18 | 19 | $P(deep,learning,is,fun)=P(deep)P(learning|deep)P(is|deep,learning)P(fun|deep,learning,is)$ 20 | 21 | 接下来,可以估计文本中所有以deep为开头序列的概率,记为$\hat{P}(deep)$,则有: 22 | 23 | $\hat{P}(deep)=\frac{n(deep)}{n(total)}$ 24 | 25 | 推广到一对连续的单词对,现在计算$\hat{P}(learning|deep)$,有: 26 | 27 | $\hat{P}(learning|deep)=\frac{n(deep,learning)}{n(deep)}$ 28 | 29 | 其中$n(x)$和$n(x,x')$分别是单个词和一对连续单词出现的次数。由于在文本序列中,类似“deep learning”这样的连续单词对出现的次数相对于单个词的出现次数低得多,要准确预测此类单词对的出现概率十分困难。对于三个以上的词组组合,对于这种词组组合的概率预测更加困难。 30 | 31 | 下面提供了一种方法,对于两个以及两个以上的词组组合的计数中分别添加一个小常量。这种方法被称为拉普拉斯平滑(Lapalace Smoothing)。但是这种方法很容易使得模型变得无效。首先,所有的计数都需要被储存,计算量大;其次,这种方法忽略了单词本身的意思,完全使用数字去计算概率使得无法根据单词的上下文调整;最后 ,很多包含多个单词的词组是没出现过的,如果只统计之前看到过的单词也就无法正确预测某些之前没出现过的长序列词组组合。 32 | 33 | 因此,下面引入了马尔可夫模型来解决长序列单词组和的概况计算问题。 34 | 35 | ### 三. 马尔可夫模型以及n元语法 36 | 马尔可夫模型(Markov model)是一种基于马尔可夫过程的概率模型。它描述了一个随机过程中,每个状态的转移是基于前一个状态的概率分布,并且当前状态只与前一个状态有关,而与之前的状态无关。 37 | 38 | 马尔可夫模型可以分为不同的类型,如一阶马尔可夫模型、二阶马尔可夫模型等。一阶马尔可夫模型是指当前状态只与前一个状态有关,而二阶马尔可夫模型是指当前状态不仅与前一个状态有关,还与前两个状态有关。 39 | 40 | 把马尔可夫模型应用到语言模型当中,若$P(x_{t+1}|x_t,...x_1)=P(x_{t+1}|x_t)$ ,即序列在时间步$t+1$处的观测只与序列在时间步$t$处的观测有关,则序列上的分布满足一阶马尔可夫性质。阶数越高,对应的依赖关系就越长。由此可推导出了许多可以应用于序列建模的近似公式: 41 | 42 | $P(x_1,x_2,x_3,x_4)=P(x_1)P(x_2)P(x_3)P(x_4)$ 43 | 44 | $P(x_1,x_2,x_3,x_4)=P(x_1)P(x_2|x_1)P(x_3|x_2)P(x_4|x_3)$ (2) 45 | 46 | $P(x_1,x_2,x_3,x_4)=P(x_1)P(x_2|x_1)P(x_3|x_1,x_2)P(x_4|x_2,x_3)$ 47 | 48 | 通常,涉及一个、两个和三个变量的概率公式分别被称为一元语法(unigram)、二元语法(bigram)和三元语法(trigram)模型。 49 | 50 | 接下来将使用代码实现语言模型。 51 | 52 | ### 四. 自然语言统计 53 | ```python 54 | tokens = d2l.tokenize(d2l.read_time_machine()) #提取词元 55 | corpus = [token for line in tokens for token in line] #生成语料库 56 | vocab = d2l.Vocab(corpus) #构造词汇表 57 | 58 | print(vocab.token_freqs[:10]) 59 | ``` 60 | 这里使用了Dive to Deep Learning的d2l库方便读入文本数据和构造词汇表。 61 | 62 | 在有了词汇表之后就可以画词频图 63 | ```python 64 | freq = [freq for token, freq in vocab.token_freqs] 65 | d2l.plot(freq, xlabel='token: x', ylabel='freq: y', xscale='log', yscale='log') 66 | ``` 67 |
68 |
69 | 图1:一元语法词频图 70 |
71 | 72 | 73 | 某些出现频次过高的词元会被划分为停用词(Stop Words)。停用词是指在文本处理中经常要忽略的词汇,因为这些词通常不对文本的意义产生重要贡献。常见的停用词包括代词、介词、连词、冠词等。另外,在英文中还有一些高频词如 "the" "and" "a" 等被认为是停用词。 74 | 75 | 接着是二元语法和三元语法的词频,即数据集中两个词元为一组的组合和三个词一组的组合出现的频率。 76 | ```python 77 | #二元语法 78 | bi_tokens = [pair for pair in zip(corpus[:-1], corpus[1:])] 79 | bi_vocab = d2l.Vocab(bi_tokens) 80 | bi_vocab.token_freqs[:10] 81 | #三元语法 82 | tri_tokens = [triple for triple in zip(corpus[:-2], corpus[1:-1], corpus[2:])] 83 | tri_vocab = d2l.Vocab(tri_tokens) 84 | tri_vocab.token_freqs[:10] 85 | ``` 86 | 画出词频图 87 | ```python 88 | bi_freq = [freq for token, freq in bi_vocab.token_freqs] 89 | tri_freq = [freq for token, freq in tri_vocab.token_freqs] 90 | d2l.plot([freq, bi_freq, tri_freq], xlabel='token: x', ylabel='freq: y', xscale='log', yscale='log', legend=['uni_freq','bi_freq','tri_freq']) 91 | ``` 92 |
93 |
94 | 图2:词频图 95 |
96 | 97 | 通过词频图可以看出,词频以一种明确的方式衰减。在剔除前几个单词后,后续的单词遵循双对数坐标上的一条直线衰减。这意味着词频满足齐普夫定律(Zipf's Law),即第 98 | 个词的词频 99 | 为: 100 | 101 | $n_i\propto \frac{1}{i^{\alpha} }$ (3) 102 | 103 | 等价于 104 | 105 | $\log n_i=-\alpha \log i + c$ (4) 106 | 107 | 其中$\alpha$是刻画分布的指数,$c$是常数。 108 | 109 | 这告诉我们想要通过计数统计和平滑来建模单词是不可行的, 因为这样建模的结果会大大高估尾部单词的频率,也就是所谓的不常用单词。 110 | 111 | ### 五. 读取长序列数据 112 | 在模型训练长序列数据时,需要将这些长序列数据拆分开变成几段短序列方便模型读取。模型一次处理具有预定义长度的一个小批量序列。现在需要解决的是如何在原始文本序列中随机生成一个小批量数据的标签和特征以供读取。 113 | 114 | 由于文本序列的长度可以被任意分割,在这里可以定义一个时间步数 115 | ,利用这个时间步数将文本序列分割为若干个具有相同时间步数的子序列。并且可以任意选择偏移量来指示分割开始的初始位置。例子如下,设$n=5$,则有 116 | 117 |
118 |
119 | 图3:分割出来的子序列(图源:Dive to Deep Learning) 120 |
121 | 122 | 如图3,不同的偏移量会导致产生不同的子序列。为了保证随机性,在这里选择随机偏移量作为起始位置。下面将实现随机采样和顺序分区来分割文本序列。 123 | 124 | ### 六. 随机采样 125 | 在随机采样(random sampling)中,每个子序列都是在原始长序列上任意捕获的短序列。在每次采样的过程中,采样之后两个相邻的子序列在原始长序列上不一定是相邻的。对于语言模型,特征(feature)是到目前为止能观测到的词元,而标签(label)则是位移了一个词元的原始序列。 126 | ```python 127 | def seq_data_iter_random(corpus, batch_size, num_steps): #num_steps为随即偏移量 128 | # 依据随即偏移量,对数据集进行顺序分区 129 | corpus = corpus[random.randint(0, num_steps-1):] 130 | # 计算子序列的数量 131 | num_subseqs = (len(corpus)-1) // num_steps 132 | # 建立长度为num_steps的子序列的起始索引 133 | indices = list(range(0, num_subseqs * num_steps, num_steps)) 134 | random.shuffle(indices) 135 | 136 | def data(pos): 137 | # 返回指定区间长度的序列 138 | return corpus[pos: pos + num_steps] 139 | 140 | num_batches = num_subseqs // batch_size 141 | for i in range(0, num_batches * batch_size, batch_size): 142 | #每个batch的起始位置 143 | iter_indices_per_batch = indices[i: i + batch_size] 144 | X = [data(j) for j in iter_indices_per_batch] 145 | Y = [data(j + 1) for j in iter_indices_per_batch] 146 | # 保留一次迭代的特征和标签 147 | yield torch.tensor(X), torch.tensor(Y) 148 | ``` 149 | 下面随机生成一个任意长度的序列来验证随机采样的效果。 150 | 151 | 设序列长度为34,时间步长为5,batch_size为2 152 | ```python 153 | seq = list(range(34)) 154 | for X, Y in seq_data_iter_random(seq, 2, 5): 155 | print("X:",X, "\nY:", Y) 156 | ``` 157 | 运行结果: 158 | 159 |
160 |
161 | 图4:随机采样 162 |
163 | 164 | 从图4可以看出,一共生成了3组子序列,并且每一组序列里的特征都是随机采样的,任意相邻的两条子序列在原始长序列中不相邻。 165 | 166 | ### 七. 顺序分区 167 | 在随机采样中,得到的两个相邻的子序列之间在原始序列上是不相邻的。如果想要得到两个在原始序列上也相邻的子序列则需要使用顺序分区(Sequential Partitioning)。这种方法在基于小批量的迭代过程中保留了拆分的子序列的顺序,因此被称为顺序分区。 168 | ```python 169 | def seq_data_iter_sequential(corpus, batch_size, num_steps): 170 | #设置随机偏移量 171 | offset = random.randint(0, num_steps) 172 | #计算使用偏移量分割之后每个步长内的词元数量 173 | num_tokens = ((len(corpus) - offset - 1) // batch_size) * batch_size 174 | #储存按照偏移量分割之后的原始长序列 175 | Xs = torch.tensor(corpus[offset: offset + num_tokens]) 176 | Ys = torch.tensor(corpus[offset + 1: offset + 1 + num_tokens]) 177 | Xs, Ys = Xs.reshape(batch_size, -1), Ys.reshape(batch_size, -1) 178 | #计算一共产生的batch数量 179 | num_batches = Xs.shape[1] // num_steps 180 | for i in range(0, num_batches * num_steps, num_steps): 181 | #从之前储存的分割完成的原始长序列当中按时间步长提取子序列 182 | X = Xs[:, i:i + num_steps] 183 | Y = Ys[:, i:i + num_steps] 184 | yield X, Y 185 | ``` 186 | 与随机采样不同的是,顺序分区在最后迭代提取一对子序列的时候,前一个子序列保持不变,而它的相邻序列会直接从在原始序列上与其相邻的下一个词元开始提取,这样就能保证提取出来的两个子序列在原始序列上也是相邻的。 187 | 188 | 在之前生成的随机长序列上测试效果: 189 | 190 |
191 |
192 | 图5:顺序分区 193 |
194 | 195 | 由图5可知,同样一共生成了三组子序列,可以看出其中两两相邻的子序列在原始长序列上也是相邻的。 196 | 197 | 接着将随机采样函数和顺序分区采样函数整合到一个类当中,做成一个数据迭代器: 198 | ```python 199 | class SeqDataLoader: 200 | def __init__(self, batch_size, num_steps, use_random_iter, max_tokens): 201 | if use_random_iter: 202 | self.data_iter_fn = d2l.seq_data_iter_random 203 | else: 204 | self.data_iter_fn = d2l.seq_data_iter_sequential 205 | self.corpus, self.vocab = d2l.load_corpus_time_machine(max_tokens) 206 | self.batch_size, self.num_steps = batch_size, num_steps 207 | 208 | def __iter__(self): 209 | return self.data_iter_fn(self.corpus, self.batch_size, self.num_steps) 210 | ``` 211 | 最后定义一个load_data_time_machine()函数,使得能同时返回数据迭代器(采样器)和词汇表: 212 | ```python 213 | def load_data_time_machine(batch_size, num_steps, use_random_iter=False, max_tokens=10000): 214 | data_iter = SeqDataLoader(batch_size, num_steps, use_random_iter, max_tokens) 215 | return data_iter, data_iter.Vocab 216 | ``` 217 | --------------------------------------------------------------------------------