├── .gitignore ├── Llama-2 ├── Part 1 │ └── BabyLLaMA.ipynb ├── Part 2 │ └── BabyLLaMA.ipynb └── Part 3 │ └── BabyLLaMA.ipynb ├── Llama-3 ├── Part 1 │ └── Downcycling.ipynb └── Part 2 │ ├── Config │ └── train-llama-3-6B.yml │ ├── Downcycling_Comparision.ipynb │ ├── FineWeb10B.ipynb │ └── assets │ ├── Comparision_of_Model_Scores.png │ ├── Experiment Canvas.png │ ├── Llama-3-8B-vs-6B-v0.png │ ├── Training Loss.png │ ├── downcycling.png │ ├── llama-3-6B icon.jpeg │ ├── model_scores.png │ └── model_scores_llama_3_8B.png └── README.md /.gitignore: -------------------------------------------------------------------------------- 1 | .DS_Store 2 | -------------------------------------------------------------------------------- /Llama-2/Part 1/BabyLLaMA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | " \"Open\n", 9 | "" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "n4oOcpvm2Bt6" 16 | }, 17 | "source": [ 18 | " # BabyLLaMA\n", 19 | "\n", 20 | "Coding the LLaMA-2 research paper from scratch to create models with sizes 100M, 250M and 500M params." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "id": "NrmuruVN5_2F" 27 | }, 28 | "source": [ 29 | "## Model Arch\n", 30 | "\n", 31 | "Decoder only: Composed of identical `n_layers`. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple position-wise fully connected FFN. We employ residual connection around each of the sub-layers, followed by layers normalizatin. That is:\n", 32 | "LayerNorm(x + Sublayer(x))\n", 33 | " -- A Vaswani et al., 2017." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "id": "bZjKLwgz6os2" 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "import torch\n", 45 | "from torch import nn\n", 46 | "import torch.nn.functional as F\n", 47 | "from math import sqrt" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": { 54 | "id": "xf2Oy7eGLr8f" 55 | }, 56 | "outputs": [], 57 | "source": [ 58 | "n_layers = 6 # 22 Tiny LLaMA\n", 59 | "n_heads = 6 # 32 Tiny LLaMA\n", 60 | "d_model = 768 # 2048 Tiny LLaMA\n", 61 | "intermediate_dim = d_model * 4" 62 | ] 63 | }, 64 | { 65 | "cell_type": "markdown", 66 | "metadata": { 67 | "id": "tmh9Ms0WKW6B" 68 | }, 69 | "source": [ 70 | "### MHA\n", 71 | "" 72 | ] 73 | }, 74 | { 75 | "cell_type": "code", 76 | "execution_count": null, 77 | "metadata": { 78 | "id": "yR6sUtOaUZ-r" 79 | }, 80 | "outputs": [], 81 | "source": [ 82 | "# Generate random input data\n", 83 | "sequence_length = 10 # number of tokens\n", 84 | "batch_size = 5\n", 85 | "input_data = torch.rand((batch_size, sequence_length, d_model)) # [bs, sequence_length, d_model]" 86 | ] 87 | }, 88 | { 89 | "cell_type": "code", 90 | "execution_count": null, 91 | "metadata": { 92 | "colab": { 93 | "base_uri": "https://localhost:8080/" 94 | }, 95 | "id": "Q0P9tmSWVMrz", 96 | "outputId": "debef009-d3d9-41dd-b7bc-dd1a644b15f0" 97 | }, 98 | "outputs": [ 99 | { 100 | "data": { 101 | "text/plain": [ 102 | "torch.Size([5, 10, 768])" 103 | ] 104 | }, 105 | "execution_count": 99, 106 | "metadata": {}, 107 | "output_type": "execute_result" 108 | } 109 | ], 110 | "source": [ 111 | "input_data.shape" 112 | ] 113 | }, 114 | { 115 | "cell_type": "markdown", 116 | "metadata": { 117 | "id": "X6fJhFVGZQDG" 118 | }, 119 | "source": [ 120 | "- MQA\n", 121 | "- GQA" 122 | ] 123 | }, 124 | { 125 | "cell_type": "code", 126 | "execution_count": null, 127 | "metadata": { 128 | "id": "L6BA3bdJJTVu" 129 | }, 130 | "outputs": [], 131 | "source": [ 132 | "class AttentionHead(nn.Module):\n", 133 | " def __init__(self, embed_dim, hidden_dim):\n", 134 | " super(AttentionHead, self).__init__()\n", 135 | " self.q = nn.Linear(embed_dim, hidden_dim)\n", 136 | " self.k = nn.Linear(embed_dim, hidden_dim)\n", 137 | " self.v = nn.Linear(embed_dim, hidden_dim)\n", 138 | "\n", 139 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n", 140 | " dim_k = q.size(-1)\n", 141 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n", 142 | " if mask is not None:\n", 143 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n", 144 | " weights = F.softmax(scores, dim=-1)\n", 145 | " return torch.bmm(weights, v)\n", 146 | "\n", 147 | " def forward(self, hidden_state, mask=None):\n", 148 | " output = self.scaled_dot_product_attention(\n", 149 | " self.q(hidden_state), self.k(hidden_state), self.v(hidden_state), mask=mask\n", 150 | " )\n", 151 | " return output" 152 | ] 153 | }, 154 | { 155 | "cell_type": "code", 156 | "execution_count": null, 157 | "metadata": { 158 | "colab": { 159 | "base_uri": "https://localhost:8080/" 160 | }, 161 | "id": "yR7l4QQIW_1S", 162 | "outputId": "7ebd7dd2-bff2-4167-d5a5-6a53c85e0d72" 163 | }, 164 | "outputs": [ 165 | { 166 | "data": { 167 | "text/plain": [ 168 | "torch.Size([5, 10, 128])" 169 | ] 170 | }, 171 | "execution_count": 8, 172 | "metadata": {}, 173 | "output_type": "execute_result" 174 | } 175 | ], 176 | "source": [ 177 | "attn = AttentionHead(d_model, d_model//n_heads)\n", 178 | "attn(input_data).shape" 179 | ] 180 | }, 181 | { 182 | "cell_type": "code", 183 | "execution_count": null, 184 | "metadata": { 185 | "id": "9D3EVPKaYXRJ" 186 | }, 187 | "outputs": [], 188 | "source": [ 189 | "class MHA(nn.Module):\n", 190 | " def __init__(self, n_heads, hidden_dim):\n", 191 | " super(MHA, self).__init__()\n", 192 | " embed_dim = hidden_dim\n", 193 | " head_dim = hidden_dim // n_heads\n", 194 | " self.heads = nn.ModuleList(\n", 195 | " [AttentionHead(embed_dim, head_dim) for _ in range(n_heads)]\n", 196 | " )\n", 197 | " self.out_proj = nn.Linear(embed_dim, embed_dim)\n", 198 | "\n", 199 | " def forward(self, hidden_state):\n", 200 | " x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)\n", 201 | " return self.out_proj(x)" 202 | ] 203 | }, 204 | { 205 | "cell_type": "code", 206 | "execution_count": null, 207 | "metadata": { 208 | "colab": { 209 | "base_uri": "https://localhost:8080/" 210 | }, 211 | "id": "DmeF6UoHe_Vy", 212 | "outputId": "bd59b7a8-4e7f-41f2-82b6-79970a9e36ad" 213 | }, 214 | "outputs": [ 215 | { 216 | "data": { 217 | "text/plain": [ 218 | "torch.Size([5, 10, 768])" 219 | ] 220 | }, 221 | "execution_count": 10, 222 | "metadata": {}, 223 | "output_type": "execute_result" 224 | } 225 | ], 226 | "source": [ 227 | "mha = MHA(n_heads, d_model)\n", 228 | "mha(input_data).shape" 229 | ] 230 | }, 231 | { 232 | "cell_type": "code", 233 | "execution_count": null, 234 | "metadata": { 235 | "colab": { 236 | "base_uri": "https://localhost:8080/" 237 | }, 238 | "id": "jnqN13SHOMQF", 239 | "outputId": "c43aecaf-fac5-456d-a759-37df602cc6a3" 240 | }, 241 | "outputs": [ 242 | { 243 | "data": { 244 | "text/plain": [ 245 | "torch.Size([5, 10, 768])" 246 | ] 247 | }, 248 | "execution_count": 11, 249 | "metadata": {}, 250 | "output_type": "execute_result" 251 | } 252 | ], 253 | "source": [ 254 | "input_data.shape" 255 | ] 256 | }, 257 | { 258 | "cell_type": "code", 259 | "execution_count": null, 260 | "metadata": { 261 | "colab": { 262 | "base_uri": "https://localhost:8080/" 263 | }, 264 | "id": "xj0XTxleorjE", 265 | "outputId": "f7e134d7-513b-4e50-f036-df84ef878274" 266 | }, 267 | "outputs": [ 268 | { 269 | "data": { 270 | "text/plain": [ 271 | "MHA(\n", 272 | " (heads): ModuleList(\n", 273 | " (0-5): 6 x AttentionHead(\n", 274 | " (q): Linear(in_features=768, out_features=128, bias=True)\n", 275 | " (k): Linear(in_features=768, out_features=128, bias=True)\n", 276 | " (v): Linear(in_features=768, out_features=128, bias=True)\n", 277 | " )\n", 278 | " )\n", 279 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n", 280 | ")" 281 | ] 282 | }, 283 | "execution_count": 12, 284 | "metadata": {}, 285 | "output_type": "execute_result" 286 | } 287 | ], 288 | "source": [ 289 | "mha" 290 | ] 291 | }, 292 | { 293 | "cell_type": "code", 294 | "execution_count": null, 295 | "metadata": { 296 | "id": "rQfC90HwjtgN" 297 | }, 298 | "outputs": [], 299 | "source": [ 300 | "class LLaMAMLP(nn.Module):\n", 301 | " def __init__(self, hidden_dim, intermediate_dim): # in MLP: intermediate_dim= 4 * hidden_dim\n", 302 | " super(LLaMAMLP, self).__init__()\n", 303 | " self.linear_1 = nn.Linear(hidden_dim, intermediate_dim)\n", 304 | " self.linear_2 = nn.Linear(hidden_dim, intermediate_dim) # Original: intermediate -> hidden.\n", 305 | " self.activation_fn = nn.SiLU()\n", 306 | " self.out_proj = nn.Linear(intermediate_dim, hidden_dim) # Original: dropout\n", 307 | "\n", 308 | "\n", 309 | " def forward(self, hidden_state):\n", 310 | " x_fc_1 = self.linear_1(hidden_state)\n", 311 | " x_fc_2 = self.linear_2(hidden_state)\n", 312 | " x = self.activation_fn(x_fc_1) * x_fc_2\n", 313 | " return self.out_proj(x)" 314 | ] 315 | }, 316 | { 317 | "cell_type": "code", 318 | "execution_count": null, 319 | "metadata": { 320 | "colab": { 321 | "base_uri": "https://localhost:8080/" 322 | }, 323 | "id": "HifM3AeAq44l", 324 | "outputId": "b7a981f1-a0e7-44a6-bf51-ee44a71d9311" 325 | }, 326 | "outputs": [ 327 | { 328 | "data": { 329 | "text/plain": [ 330 | "torch.Size([5, 10, 768])" 331 | ] 332 | }, 333 | "execution_count": 18, 334 | "metadata": {}, 335 | "output_type": "execute_result" 336 | } 337 | ], 338 | "source": [ 339 | "mlp = LLaMAMLP(d_model, intermediate_dim)\n", 340 | "mlp(input_data).shape" 341 | ] 342 | }, 343 | { 344 | "cell_type": "code", 345 | "execution_count": null, 346 | "metadata": { 347 | "id": "DDnXHH6bP4a3" 348 | }, 349 | "outputs": [], 350 | "source": [ 351 | "class Block(nn.Module):\n", 352 | " def __init__(self, n_heads, hidden_dim, intermediate_dim):\n", 353 | " super(Block, self).__init__()\n", 354 | " self.n_heads = n_heads\n", 355 | " self.hidden_dim = hidden_dim\n", 356 | " self.intermediate_dim = intermediate_dim\n", 357 | " self.mha = MHA(n_heads, hidden_dim=hidden_dim)\n", 358 | " self.layer_norm = nn.LayerNorm(hidden_dim)\n", 359 | " self.mlp = LLaMAMLP(hidden_dim, intermediate_dim)\n", 360 | "\n", 361 | " def forward(self, hidden_state, mask=None):\n", 362 | " x = self.mha(hidden_state)\n", 363 | " x = self.layer_norm(hidden_state) + x\n", 364 | " x_fc = self.mlp(x)\n", 365 | " x += x_fc\n", 366 | " return x\n" 367 | ] 368 | }, 369 | { 370 | "cell_type": "code", 371 | "execution_count": null, 372 | "metadata": { 373 | "colab": { 374 | "base_uri": "https://localhost:8080/" 375 | }, 376 | "id": "S5UplSTmSDpt", 377 | "outputId": "12f728e4-056d-4bdf-f1eb-f06c5982831c" 378 | }, 379 | "outputs": [ 380 | { 381 | "data": { 382 | "text/plain": [ 383 | "torch.Size([5, 10, 768])" 384 | ] 385 | }, 386 | "execution_count": 20, 387 | "metadata": {}, 388 | "output_type": "execute_result" 389 | } 390 | ], 391 | "source": [ 392 | "block = Block(n_heads, d_model, intermediate_dim)\n", 393 | "block(input_data).shape" 394 | ] 395 | }, 396 | { 397 | "cell_type": "code", 398 | "execution_count": null, 399 | "metadata": { 400 | "id": "GcngjpHCSUnh" 401 | }, 402 | "outputs": [], 403 | "source": [ 404 | "class babyLLaMA(nn.Module):\n", 405 | " def __init__(self, max_seq_len, vocab_size, n_layers, n_heads, hidden_dim, intermediate_dim):\n", 406 | " super(babyLLaMA, self).__init__()\n", 407 | " self.emb = nn.Embedding(vocab_size, hidden_dim)\n", 408 | " self.pos = nn.Embedding(max_seq_len, hidden_dim)\n", 409 | " self.blocks = nn.ModuleList(\n", 410 | " [Block(n_heads, hidden_dim, intermediate_dim) for _ in range(n_layers)]\n", 411 | " )\n", 412 | " self.out_proj = nn.Linear(hidden_dim, vocab_size)\n", 413 | "\n", 414 | " def forward(self, hidden_state):\n", 415 | " emb = self.emb(hidden_state)\n", 416 | " seq_len = hidden_state.size(1)\n", 417 | " positions = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)\n", 418 | " pos = self.pos(positions)\n", 419 | " x = emb + pos\n", 420 | " for b in self.blocks:\n", 421 | " x = b(x)\n", 422 | "\n", 423 | " x = self.out_proj(x)\n", 424 | " return F.softmax(x, dim=-1)\n" 425 | ] 426 | }, 427 | { 428 | "cell_type": "code", 429 | "execution_count": null, 430 | "metadata": { 431 | "id": "HZqvBWtmVxUk" 432 | }, 433 | "outputs": [], 434 | "source": [ 435 | "llm = babyLLaMA(d_model, 32000, 22, n_heads, d_model, intermediate_dim)\n", 436 | "input_ids = torch.randint(1, 32000, (batch_size, sequence_length))" 437 | ] 438 | }, 439 | { 440 | "cell_type": "code", 441 | "execution_count": null, 442 | "metadata": { 443 | "colab": { 444 | "base_uri": "https://localhost:8080/" 445 | }, 446 | "id": "Lk9xlnKeYNuY", 447 | "outputId": "6b45abed-323a-4ea7-f105-0a9503ba1b85" 448 | }, 449 | "outputs": [ 450 | { 451 | "data": { 452 | "text/plain": [ 453 | "torch.Size([5, 10, 32000])" 454 | ] 455 | }, 456 | "execution_count": 43, 457 | "metadata": {}, 458 | "output_type": "execute_result" 459 | } 460 | ], 461 | "source": [ 462 | "llm(input_ids).shape" 463 | ] 464 | }, 465 | { 466 | "cell_type": "code", 467 | "execution_count": null, 468 | "metadata": { 469 | "colab": { 470 | "base_uri": "https://localhost:8080/" 471 | }, 472 | "id": "geZxTDtrYica", 473 | "outputId": "23823184-49d3-4140-8bc9-af08580d9f7d" 474 | }, 475 | "outputs": [ 476 | { 477 | "data": { 478 | "text/plain": [ 479 | "babyLLaMA(\n", 480 | " (emb): Embedding(32000, 768)\n", 481 | " (pos): Embedding(768, 768)\n", 482 | " (blocks): ModuleList(\n", 483 | " (0-5): 6 x Block(\n", 484 | " (mha): MHA(\n", 485 | " (heads): ModuleList(\n", 486 | " (0-5): 6 x AttentionHead(\n", 487 | " (q): Linear(in_features=768, out_features=128, bias=True)\n", 488 | " (k): Linear(in_features=768, out_features=128, bias=True)\n", 489 | " (v): Linear(in_features=768, out_features=128, bias=True)\n", 490 | " )\n", 491 | " )\n", 492 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n", 493 | " )\n", 494 | " (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n", 495 | " (mlp): LLaMAMLP(\n", 496 | " (linear_1): Linear(in_features=768, out_features=3072, bias=True)\n", 497 | " (linear_2): Linear(in_features=768, out_features=3072, bias=True)\n", 498 | " (activation_fn): SiLU()\n", 499 | " (out_proj): Linear(in_features=3072, out_features=768, bias=True)\n", 500 | " )\n", 501 | " )\n", 502 | " )\n", 503 | " (out_proj): Linear(in_features=768, out_features=32000, bias=True)\n", 504 | ")" 505 | ] 506 | }, 507 | "execution_count": 38, 508 | "metadata": {}, 509 | "output_type": "execute_result" 510 | } 511 | ], 512 | "source": [ 513 | "llm" 514 | ] 515 | }, 516 | { 517 | "cell_type": "code", 518 | "execution_count": null, 519 | "metadata": { 520 | "id": "3jNqAo_vYlHT" 521 | }, 522 | "outputs": [], 523 | "source": [ 524 | "def count_parameters(model):\n", 525 | " return sum(p.numel() for p in model.parameters() if p.requires_grad)" 526 | ] 527 | }, 528 | { 529 | "cell_type": "code", 530 | "execution_count": null, 531 | "metadata": { 532 | "colab": { 533 | "base_uri": "https://localhost:8080/" 534 | }, 535 | "id": "FUnuoPINY9VS", 536 | "outputId": "cd62a0e3-8f17-43cf-9f60-cb8122eb1913" 537 | }, 538 | "outputs": [ 539 | { 540 | "data": { 541 | "text/plain": [ 542 | "257645312" 543 | ] 544 | }, 545 | "execution_count": 45, 546 | "metadata": {}, 547 | "output_type": "execute_result" 548 | } 549 | ], 550 | "source": [ 551 | "count_parameters(llm)" 552 | ] 553 | }, 554 | { 555 | "cell_type": "code", 556 | "execution_count": null, 557 | "metadata": { 558 | "id": "MJSfGx2I2Io6" 559 | }, 560 | "outputs": [], 561 | "source": [] 562 | } 563 | ], 564 | "metadata": { 565 | "colab": { 566 | "provenance": [] 567 | }, 568 | "kernelspec": { 569 | "display_name": "Python 3", 570 | "name": "python3" 571 | }, 572 | "language_info": { 573 | "name": "python" 574 | } 575 | }, 576 | "nbformat": 4, 577 | "nbformat_minor": 0 578 | } 579 | -------------------------------------------------------------------------------- /Llama-2/Part 2/BabyLLaMA.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "markdown", 5 | "metadata": {}, 6 | "source": [ 7 | "\n", 8 | " \"Open\n", 9 | "" 10 | ] 11 | }, 12 | { 13 | "cell_type": "markdown", 14 | "metadata": { 15 | "id": "n4oOcpvm2Bt6" 16 | }, 17 | "source": [ 18 | " # BabyLLaMA\n", 19 | "\n", 20 | "Coding the LLaMA-2 research paper from scratch to create models with sizes 100M, 250M and 500M params." 21 | ] 22 | }, 23 | { 24 | "cell_type": "markdown", 25 | "metadata": { 26 | "id": "NrmuruVN5_2F" 27 | }, 28 | "source": [ 29 | "## Model Arch\n", 30 | "\n", 31 | "Decoder only: Composed of identical `n_layers`. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple position-wise fully connected FFN. We employ residual connection around each of the sub-layers, followed by layers normalizatin. That is:\n", 32 | "LayerNorm(x + Sublayer(x))\n", 33 | " -- A Vaswani et al., 2017." 34 | ] 35 | }, 36 | { 37 | "cell_type": "code", 38 | "execution_count": null, 39 | "metadata": { 40 | "id": "3jNqAo_vYlHT" 41 | }, 42 | "outputs": [], 43 | "source": [ 44 | "def count_parameters(model):\n", 45 | " return sum(p.numel() for p in model.parameters() if p.requires_grad)" 46 | ] 47 | }, 48 | { 49 | "cell_type": "code", 50 | "execution_count": null, 51 | "metadata": { 52 | "id": "bZjKLwgz6os2" 53 | }, 54 | "outputs": [], 55 | "source": [ 56 | "import torch\n", 57 | "from torch import nn\n", 58 | "import torch.nn.functional as F\n", 59 | "from math import sqrt" 60 | ] 61 | }, 62 | { 63 | "cell_type": "code", 64 | "execution_count": null, 65 | "metadata": { 66 | "id": "xf2Oy7eGLr8f" 67 | }, 68 | "outputs": [], 69 | "source": [ 70 | "n_layers = 6 # 22 Tiny LLaMA\n", 71 | "n_heads = 6 # 32 Tiny LLaMA\n", 72 | "d_model = 768 # 2048 Tiny LLaMA\n", 73 | "intermediate_dim = d_model * 4" 74 | ] 75 | }, 76 | { 77 | "cell_type": "markdown", 78 | "metadata": { 79 | "id": "tmh9Ms0WKW6B" 80 | }, 81 | "source": [ 82 | "### MHA\n", 83 | "" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "id": "yR6sUtOaUZ-r" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "# Generate random input data\n", 95 | "sequence_length = 10 # number of tokens\n", 96 | "batch_size = 5\n", 97 | "input_data = torch.rand((batch_size, sequence_length, d_model)) # [bs, sequence_length, d_model]" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "colab": { 105 | "base_uri": "https://localhost:8080/" 106 | }, 107 | "id": "Q0P9tmSWVMrz", 108 | "outputId": "debef009-d3d9-41dd-b7bc-dd1a644b15f0" 109 | }, 110 | "outputs": [ 111 | { 112 | "data": { 113 | "text/plain": [ 114 | "torch.Size([5, 10, 768])" 115 | ] 116 | }, 117 | "execution_count": 99, 118 | "metadata": {}, 119 | "output_type": "execute_result" 120 | } 121 | ], 122 | "source": [ 123 | "input_data.shape" 124 | ] 125 | }, 126 | { 127 | "cell_type": "markdown", 128 | "metadata": { 129 | "id": "X6fJhFVGZQDG" 130 | }, 131 | "source": [ 132 | "- MQA\n", 133 | "- GQA" 134 | ] 135 | }, 136 | { 137 | "cell_type": "code", 138 | "execution_count": null, 139 | "metadata": { 140 | "id": "L6BA3bdJJTVu" 141 | }, 142 | "outputs": [], 143 | "source": [ 144 | "class AttentionHead(nn.Module):\n", 145 | " def __init__(self, embed_dim, hidden_dim):\n", 146 | " super(AttentionHead, self).__init__()\n", 147 | " self.q = nn.Linear(embed_dim, hidden_dim)\n", 148 | " self.k = nn.Linear(embed_dim, hidden_dim)\n", 149 | " self.v = nn.Linear(embed_dim, hidden_dim)\n", 150 | "\n", 151 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n", 152 | " dim_k = q.size(-1)\n", 153 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n", 154 | " if mask is not None:\n", 155 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n", 156 | " weights = F.softmax(scores, dim=-1)\n", 157 | " return torch.bmm(weights, v)\n", 158 | "\n", 159 | " def forward(self, hidden_state, mask = None):\n", 160 | " output = self.scaled_dot_product_attention(\n", 161 | " self.q(hidden_state), self.k(hidden_state), self.v(hidden_state), mask=mask\n", 162 | " )\n", 163 | " return output" 164 | ] 165 | }, 166 | { 167 | "cell_type": "code", 168 | "execution_count": null, 169 | "metadata": { 170 | "colab": { 171 | "base_uri": "https://localhost:8080/" 172 | }, 173 | "id": "yR7l4QQIW_1S", 174 | "outputId": "7ebd7dd2-bff2-4167-d5a5-6a53c85e0d72" 175 | }, 176 | "outputs": [ 177 | { 178 | "data": { 179 | "text/plain": [ 180 | "torch.Size([5, 10, 128])" 181 | ] 182 | }, 183 | "execution_count": 8, 184 | "metadata": {}, 185 | "output_type": "execute_result" 186 | } 187 | ], 188 | "source": [ 189 | "attn = AttentionHead(d_model, d_model//n_heads)\n", 190 | "attn(input_data).shape" 191 | ] 192 | }, 193 | { 194 | "cell_type": "code", 195 | "execution_count": null, 196 | "metadata": { 197 | "id": "xdkLltN5kMlb" 198 | }, 199 | "outputs": [], 200 | "source": [] 201 | }, 202 | { 203 | "cell_type": "markdown", 204 | "metadata": { 205 | "id": "xzQ-tT98j4v4" 206 | }, 207 | "source": [ 208 | "## MHA vs GQA vs MQA\n", 209 | "![gqa.webp]()" 210 | ] 211 | }, 212 | { 213 | "cell_type": "code", 214 | "execution_count": null, 215 | "metadata": { 216 | "id": "9D3EVPKaYXRJ" 217 | }, 218 | "outputs": [], 219 | "source": [ 220 | "class MultiHeadAttention(nn.Module):\n", 221 | " def __init__(self, n_heads, hidden_dim):\n", 222 | " super(MultiHeadAttention, self).__init__()\n", 223 | " embed_dim = hidden_dim\n", 224 | " head_dim = hidden_dim // n_heads\n", 225 | " self.heads = nn.ModuleList(\n", 226 | " [AttentionHead(embed_dim, head_dim) for _ in range(n_heads)]\n", 227 | " )\n", 228 | " self.out_proj = nn.Linear(embed_dim, embed_dim)\n", 229 | "\n", 230 | " def forward(self, hidden_state):\n", 231 | " x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)\n", 232 | " return self.out_proj(x)\n", 233 | "\n", 234 | "\n", 235 | "class MultiQueryAttention(nn.Module):\n", 236 | " def __init__(self, n_q_heads, hidden_dim):\n", 237 | " super(MultiQueryAttention, self).__init__()\n", 238 | " head_dim = hidden_dim // n_heads\n", 239 | " self.queries = nn.ModuleList(\n", 240 | " [nn.Linear(hidden_dim, head_dim) for _ in range(n_q_heads)]\n", 241 | " )\n", 242 | " self.key = nn.Linear(hidden_dim, head_dim)\n", 243 | " self.value = nn.Linear(hidden_dim, head_dim)\n", 244 | " self.out_proj = nn.Linear(n_q_heads * head_dim, hidden_dim)\n", 245 | "\n", 246 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n", 247 | " dim_k = q.size(-1)\n", 248 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n", 249 | " if mask is not None:\n", 250 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n", 251 | " weights = F.softmax(scores, dim=-1)\n", 252 | " return torch.bmm(weights, v)\n", 253 | "\n", 254 | " def forward(self, hidden_state):\n", 255 | " k = self.key(hidden_state)\n", 256 | " v = self.value(hidden_state)\n", 257 | " x = torch.cat([\n", 258 | " self.scaled_dot_product_attention(query(hidden_state), k, v)\n", 259 | " for query in self.queries\n", 260 | " ], dim=-1)\n", 261 | " return self.out_proj(x)\n", 262 | "\n", 263 | "\n", 264 | "class GroupedQueryAttention(nn.Module):\n", 265 | " def __init__(self, n_q_heads_per_group, n_k_v_heads, hidden_dim):\n", 266 | " super(GroupedQueryAttention, self).__init__()\n", 267 | " self.n_k_v_heads = n_k_v_heads\n", 268 | " self.n_q_heads_per_group = n_q_heads_per_group\n", 269 | " self.hidden_dim = hidden_dim\n", 270 | " self.grouped = nn.ModuleList([\n", 271 | " MultiQueryAttention(\n", 272 | " n_q_heads=n_q_heads_per_group, hidden_dim=hidden_dim\n", 273 | " )\n", 274 | " for _ in range(n_k_v_heads)\n", 275 | " ])\n", 276 | " self.proj = nn.Linear(in_features=hidden_dim * n_k_v_heads,\n", 277 | " out_features=hidden_dim, bias=False)\n", 278 | "\n", 279 | " def forward(self, hidden_state, mask=None):\n", 280 | " Z_s = torch.cat([head(hidden_state) for head in self.grouped], dim=-1)\n", 281 | " Z = self.proj(Z_s)\n", 282 | " return Z" 283 | ] 284 | }, 285 | { 286 | "cell_type": "code", 287 | "execution_count": null, 288 | "metadata": { 289 | "colab": { 290 | "base_uri": "https://localhost:8080/" 291 | }, 292 | "id": "DmeF6UoHe_Vy", 293 | "outputId": "059a3f41-23f0-4cc0-93e2-069b634091be" 294 | }, 295 | "outputs": [ 296 | { 297 | "data": { 298 | "text/plain": [ 299 | "torch.Size([5, 10, 768])" 300 | ] 301 | }, 302 | "execution_count": 23, 303 | "metadata": {}, 304 | "output_type": "execute_result" 305 | } 306 | ], 307 | "source": [ 308 | "mha = MultiHeadAttention(n_heads, d_model)\n", 309 | "mha(input_data).shape" 310 | ] 311 | }, 312 | { 313 | "cell_type": "code", 314 | "execution_count": null, 315 | "metadata": { 316 | "colab": { 317 | "base_uri": "https://localhost:8080/" 318 | }, 319 | "id": "kHzHnc6QuzDx", 320 | "outputId": "3777829b-8b5a-49b4-e56f-6c6c40ace32a" 321 | }, 322 | "outputs": [ 323 | { 324 | "data": { 325 | "text/plain": [ 326 | "2362368" 327 | ] 328 | }, 329 | "execution_count": 86, 330 | "metadata": {}, 331 | "output_type": "execute_result" 332 | } 333 | ], 334 | "source": [ 335 | "count_parameters(mha)" 336 | ] 337 | }, 338 | { 339 | "cell_type": "code", 340 | "execution_count": null, 341 | "metadata": { 342 | "colab": { 343 | "base_uri": "https://localhost:8080/" 344 | }, 345 | "id": "UxWS1kkIkI4C", 346 | "outputId": "64757630-b346-4975-8cd6-0a1b9e346a00" 347 | }, 348 | "outputs": [ 349 | { 350 | "data": { 351 | "text/plain": [ 352 | "torch.Size([5, 10, 768])" 353 | ] 354 | }, 355 | "execution_count": 85, 356 | "metadata": {}, 357 | "output_type": "execute_result" 358 | } 359 | ], 360 | "source": [ 361 | "mqa = MultiQueryAttention(n_heads, d_model)\n", 362 | "mqa(input_data).shape" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": { 369 | "colab": { 370 | "base_uri": "https://localhost:8080/" 371 | }, 372 | "id": "jnqN13SHOMQF", 373 | "outputId": "86d62f49-c22a-4d75-edb5-abd4ff8839c5" 374 | }, 375 | "outputs": [ 376 | { 377 | "data": { 378 | "text/plain": [ 379 | "1378048" 380 | ] 381 | }, 382 | "execution_count": 84, 383 | "metadata": {}, 384 | "output_type": "execute_result" 385 | } 386 | ], 387 | "source": [ 388 | "count_parameters(mqa)" 389 | ] 390 | }, 391 | { 392 | "cell_type": "code", 393 | "execution_count": null, 394 | "metadata": { 395 | "colab": { 396 | "base_uri": "https://localhost:8080/" 397 | }, 398 | "id": "awEPrfZumBhd", 399 | "outputId": "90d9cba3-fe1c-46c1-b28e-756c7292f1b2" 400 | }, 401 | "outputs": [ 402 | { 403 | "data": { 404 | "text/plain": [ 405 | "torch.Size([5, 10, 768])" 406 | ] 407 | }, 408 | "execution_count": 103, 409 | "metadata": {}, 410 | "output_type": "execute_result" 411 | } 412 | ], 413 | "source": [ 414 | "num_q_heads = n_heads\n", 415 | "n_k_v_heads=2\n", 416 | "n_q_heads_per_group = num_q_heads // n_k_v_heads\n", 417 | "\n", 418 | "gqa = GroupedQueryAttention(n_q_heads_per_group=n_q_heads_per_group, n_k_v_heads=n_k_v_heads, hidden_dim=d_model)\n", 419 | "gqa(input_data).shape" 420 | ] 421 | }, 422 | { 423 | "cell_type": "code", 424 | "execution_count": null, 425 | "metadata": { 426 | "colab": { 427 | "base_uri": "https://localhost:8080/" 428 | }, 429 | "id": "7YVIJen6urJW", 430 | "outputId": "5ea7e561-8deb-4757-823c-00785dcfd9ba" 431 | }, 432 | "outputs": [ 433 | { 434 | "data": { 435 | "text/plain": [ 436 | "2755328" 437 | ] 438 | }, 439 | "execution_count": 104, 440 | "metadata": {}, 441 | "output_type": "execute_result" 442 | } 443 | ], 444 | "source": [ 445 | "count_parameters(gqa)" 446 | ] 447 | }, 448 | { 449 | "cell_type": "code", 450 | "execution_count": null, 451 | "metadata": { 452 | "colab": { 453 | "base_uri": "https://localhost:8080/" 454 | }, 455 | "id": "xj0XTxleorjE", 456 | "outputId": "f7e134d7-513b-4e50-f036-df84ef878274" 457 | }, 458 | "outputs": [ 459 | { 460 | "data": { 461 | "text/plain": [ 462 | "MHA(\n", 463 | " (heads): ModuleList(\n", 464 | " (0-5): 6 x AttentionHead(\n", 465 | " (q): Linear(in_features=768, out_features=128, bias=True)\n", 466 | " (k): Linear(in_features=768, out_features=128, bias=True)\n", 467 | " (v): Linear(in_features=768, out_features=128, bias=True)\n", 468 | " )\n", 469 | " )\n", 470 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n", 471 | ")" 472 | ] 473 | }, 474 | "execution_count": 12, 475 | "metadata": {}, 476 | "output_type": "execute_result" 477 | } 478 | ], 479 | "source": [ 480 | "mha" 481 | ] 482 | }, 483 | { 484 | "cell_type": "code", 485 | "execution_count": null, 486 | "metadata": { 487 | "id": "rQfC90HwjtgN" 488 | }, 489 | "outputs": [], 490 | "source": [ 491 | "class LLaMAMLP(nn.Module):\n", 492 | " def __init__(self, hidden_dim, intermediate_dim): # in MLP: intermediate_dim= 4 * hidden_dim\n", 493 | " super(LLaMAMLP, self).__init__()\n", 494 | " self.linear_1 = nn.Linear(hidden_dim, intermediate_dim)\n", 495 | " self.linear_2 = nn.Linear(hidden_dim, intermediate_dim) # Original: intermediate -> hidden.\n", 496 | " self.activation_fn = nn.SiLU()\n", 497 | " self.out_proj = nn.Linear(intermediate_dim, hidden_dim) # Original: dropout\n", 498 | "\n", 499 | "\n", 500 | " def forward(self, hidden_state):\n", 501 | " x_fc_1 = self.linear_1(hidden_state)\n", 502 | " x_fc_2 = self.linear_2(hidden_state)\n", 503 | " x = self.activation_fn(x_fc_1) * x_fc_2\n", 504 | " return self.out_proj(x)" 505 | ] 506 | }, 507 | { 508 | "cell_type": "code", 509 | "execution_count": null, 510 | "metadata": { 511 | "colab": { 512 | "base_uri": "https://localhost:8080/" 513 | }, 514 | "id": "HifM3AeAq44l", 515 | "outputId": "f562f104-fe04-4045-d96f-bab97c8fb4f9" 516 | }, 517 | "outputs": [ 518 | { 519 | "data": { 520 | "text/plain": [ 521 | "torch.Size([5, 10, 768])" 522 | ] 523 | }, 524 | "execution_count": 51, 525 | "metadata": {}, 526 | "output_type": "execute_result" 527 | } 528 | ], 529 | "source": [ 530 | "mlp = LLaMAMLP(d_model, intermediate_dim)\n", 531 | "mlp(input_data).shape" 532 | ] 533 | }, 534 | { 535 | "cell_type": "code", 536 | "execution_count": null, 537 | "metadata": { 538 | "id": "DDnXHH6bP4a3" 539 | }, 540 | "outputs": [], 541 | "source": [ 542 | "class Block(nn.Module):\n", 543 | " def __init__(self, n_heads, n_k_v_heads, hidden_dim, intermediate_dim):\n", 544 | " super(Block, self).__init__()\n", 545 | " self.n_heads = n_heads\n", 546 | " self.hidden_dim = hidden_dim\n", 547 | " self.intermediate_dim = intermediate_dim\n", 548 | "\n", 549 | " # Self-Attention (MHA, MQA & GQA)\n", 550 | " if n_heads == n_k_v_heads:\n", 551 | " # MHA selected\n", 552 | " self.attn = MultiHeadAttention(n_heads, hidden_dim=hidden_dim)\n", 553 | " elif n_k_v_heads == 1 :\n", 554 | " # MQA selected\n", 555 | " self.attn = MultiQueryAttention(n_heads, hidden_dim=hidden_dim)\n", 556 | " elif n_heads // n_k_v_heads > 1:\n", 557 | " # GQA selected\n", 558 | " self.attn = GroupedQueryAttention(n_heads // n_k_v_heads, n_k_v_heads, hidden_dim=hidden_dim)\n", 559 | " else:\n", 560 | " # MHA selected\n", 561 | " self.attn = MultiHeadAttention(n_heads, hidden_dim=hidden_dim)\n", 562 | "\n", 563 | " self.layer_norm = nn.LayerNorm(hidden_dim)\n", 564 | " self.mlp = LLaMAMLP(hidden_dim, intermediate_dim)\n", 565 | "\n", 566 | " def forward(self, hidden_state, mask=None):\n", 567 | " x = self.attn(hidden_state)\n", 568 | " x = self.layer_norm(hidden_state) + x\n", 569 | " x_fc = self.mlp(x)\n", 570 | " x += x_fc\n", 571 | " return x\n" 572 | ] 573 | }, 574 | { 575 | "cell_type": "code", 576 | "execution_count": null, 577 | "metadata": { 578 | "colab": { 579 | "base_uri": "https://localhost:8080/" 580 | }, 581 | "id": "S5UplSTmSDpt", 582 | "outputId": "cb272aad-0958-4267-9266-f0a5ee4eb557" 583 | }, 584 | "outputs": [ 585 | { 586 | "name": "stdout", 587 | "output_type": "stream", 588 | "text": [ 589 | "GQA selected\n" 590 | ] 591 | }, 592 | { 593 | "data": { 594 | "text/plain": [ 595 | "torch.Size([5, 10, 768])" 596 | ] 597 | }, 598 | "execution_count": 112, 599 | "metadata": {}, 600 | "output_type": "execute_result" 601 | } 602 | ], 603 | "source": [ 604 | "block = Block(n_heads, n_k_v_heads, d_model, intermediate_dim)\n", 605 | "block(input_data).shape" 606 | ] 607 | }, 608 | { 609 | "cell_type": "code", 610 | "execution_count": null, 611 | "metadata": { 612 | "id": "GcngjpHCSUnh" 613 | }, 614 | "outputs": [], 615 | "source": [ 616 | "class babyLLaMA(nn.Module):\n", 617 | " def __init__(self, max_seq_len, vocab_size, n_layers, n_heads, n_k_v_heads, hidden_dim, intermediate_dim):\n", 618 | " super(babyLLaMA, self).__init__()\n", 619 | " self.emb = nn.Embedding(vocab_size, hidden_dim)\n", 620 | " self.pos = nn.Embedding(max_seq_len, hidden_dim)\n", 621 | " self.blocks = nn.ModuleList(\n", 622 | " [Block(n_heads, n_k_v_heads, hidden_dim, intermediate_dim) for _ in range(n_layers)]\n", 623 | " )\n", 624 | " self.out_proj = nn.Linear(hidden_dim, vocab_size)\n", 625 | "\n", 626 | " def forward(self, hidden_state):\n", 627 | " emb = self.emb(hidden_state)\n", 628 | " seq_len = hidden_state.size(1)\n", 629 | " positions = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)\n", 630 | " pos = self.pos(positions)\n", 631 | " x = emb + pos\n", 632 | "\n", 633 | " for b in self.blocks:\n", 634 | " x = b(x)\n", 635 | "\n", 636 | " x = self.out_proj(x)\n", 637 | " return F.softmax(x, dim=-1)\n" 638 | ] 639 | }, 640 | { 641 | "cell_type": "code", 642 | "execution_count": null, 643 | "metadata": { 644 | "id": "HZqvBWtmVxUk" 645 | }, 646 | "outputs": [], 647 | "source": [ 648 | "llm = babyLLaMA(d_model, 32000, 22, n_heads, n_k_v_heads, d_model, intermediate_dim)\n", 649 | "input_ids = torch.randint(1, 32000, (batch_size, sequence_length))" 650 | ] 651 | }, 652 | { 653 | "cell_type": "code", 654 | "execution_count": null, 655 | "metadata": { 656 | "colab": { 657 | "base_uri": "https://localhost:8080/" 658 | }, 659 | "id": "Lk9xlnKeYNuY", 660 | "outputId": "c380de0c-0737-4537-a586-22a2d757a1d0" 661 | }, 662 | "outputs": [ 663 | { 664 | "data": { 665 | "text/plain": [ 666 | "torch.Size([5, 10, 32000])" 667 | ] 668 | }, 669 | "execution_count": 69, 670 | "metadata": {}, 671 | "output_type": "execute_result" 672 | } 673 | ], 674 | "source": [ 675 | "llm(input_ids).shape" 676 | ] 677 | }, 678 | { 679 | "cell_type": "code", 680 | "execution_count": null, 681 | "metadata": { 682 | "colab": { 683 | "base_uri": "https://localhost:8080/" 684 | }, 685 | "id": "geZxTDtrYica", 686 | "outputId": "13f3cc25-5ed2-453c-a787-c6e5e8a3d3f8" 687 | }, 688 | "outputs": [ 689 | { 690 | "data": { 691 | "text/plain": [ 692 | "babyLLaMA(\n", 693 | " (emb): Embedding(32000, 768)\n", 694 | " (pos): Embedding(768, 768)\n", 695 | " (blocks): ModuleList(\n", 696 | " (0-21): 22 x Block(\n", 697 | " (mqa): MultiQueryAttention(\n", 698 | " (queries): ModuleList(\n", 699 | " (0-5): 6 x Linear(in_features=768, out_features=128, bias=True)\n", 700 | " )\n", 701 | " (key): Linear(in_features=768, out_features=128, bias=True)\n", 702 | " (value): Linear(in_features=768, out_features=128, bias=True)\n", 703 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n", 704 | " )\n", 705 | " (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n", 706 | " (mlp): LLaMAMLP(\n", 707 | " (linear_1): Linear(in_features=768, out_features=3072, bias=True)\n", 708 | " (linear_2): Linear(in_features=768, out_features=3072, bias=True)\n", 709 | " (activation_fn): SiLU()\n", 710 | " (out_proj): Linear(in_features=3072, out_features=768, bias=True)\n", 711 | " )\n", 712 | " )\n", 713 | " )\n", 714 | " (out_proj): Linear(in_features=768, out_features=32000, bias=True)\n", 715 | ")" 716 | ] 717 | }, 718 | "execution_count": 75, 719 | "metadata": {}, 720 | "output_type": "execute_result" 721 | } 722 | ], 723 | "source": [ 724 | "llm" 725 | ] 726 | }, 727 | { 728 | "cell_type": "code", 729 | "execution_count": null, 730 | "metadata": { 731 | "colab": { 732 | "base_uri": "https://localhost:8080/" 733 | }, 734 | "id": "FUnuoPINY9VS", 735 | "outputId": "09a464ce-0c95-4bdf-a3e3-561e51fa27a6" 736 | }, 737 | "outputs": [ 738 | { 739 | "data": { 740 | "text/plain": [ 741 | "266290432" 742 | ] 743 | }, 744 | "execution_count": 127, 745 | "metadata": {}, 746 | "output_type": "execute_result" 747 | } 748 | ], 749 | "source": [ 750 | "count_parameters(llm)" 751 | ] 752 | } 753 | ], 754 | "metadata": { 755 | "colab": { 756 | "provenance": [] 757 | }, 758 | "kernelspec": { 759 | "display_name": "Python 3", 760 | "name": "python3" 761 | }, 762 | "language_info": { 763 | "name": "python" 764 | } 765 | }, 766 | "nbformat": 4, 767 | "nbformat_minor": 0 768 | } 769 | -------------------------------------------------------------------------------- /Llama-3/Part 1/Downcycling.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "! pip install transformers torch accelerate huggingface-hub huggingface-cli hf-transfer" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "def count_parameters(model):\n", 19 | " # Calculate the number of parameters in billions\n", 20 | " num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 10**9\n", 21 | " print(f\"Model size: {num_params:.3f}B parameters\")\n", 22 | " return int(num_params)\n" 23 | ] 24 | }, 25 | { 26 | "cell_type": "markdown", 27 | "metadata": {}, 28 | "source": [ 29 | "## Load Reference Model" 30 | ] 31 | }, 32 | { 33 | "cell_type": "code", 34 | "execution_count": null, 35 | "metadata": {}, 36 | "outputs": [], 37 | "source": [ 38 | "from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer\n", 39 | "import os\n", 40 | "\n", 41 | "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n", 42 | "\n", 43 | "# Load meta-llama/Meta-Llama-3-8B model, config and tokenizer\n", 44 | "model_name = \"meta-llama/Meta-Llama-3-8B\"\n", 45 | "model = AutoModelForCausalLM.from_pretrained(model_name)\n", 46 | "config = AutoConfig.from_pretrained(model_name)\n", 47 | "tokenizer = AutoTokenizer.from_pretrained(model_name)" 48 | ] 49 | }, 50 | { 51 | "cell_type": "code", 52 | "execution_count": null, 53 | "metadata": {}, 54 | "outputs": [], 55 | "source": [ 56 | "count_parameters(model)" 57 | ] 58 | }, 59 | { 60 | "cell_type": "code", 61 | "execution_count": null, 62 | "metadata": {}, 63 | "outputs": [], 64 | "source": [ 65 | "model" 66 | ] 67 | }, 68 | { 69 | "cell_type": "code", 70 | "execution_count": null, 71 | "metadata": {}, 72 | "outputs": [], 73 | "source": [ 74 | "def extract_model_weights(reference_model, n_layers):\n", 75 | " params = {}\n", 76 | " current_layer = 0 # To keep track of the main layer count\n", 77 | "\n", 78 | " # Iterate over all named modules\n", 79 | " for name, module in reference_model.named_modules():\n", 80 | "\n", 81 | " # Check and store parameters\n", 82 | " if hasattr(module, 'weight') and module.weight is not None:\n", 83 | " params[name + '.weight'] = module.weight.data.clone()\n", 84 | " if hasattr(module, 'bias') and module.bias is not None:\n", 85 | " params[name + '.bias'] = module.bias.data.clone()\n", 86 | "\n", 87 | " if 'model.layers.' in name:\n", 88 | " # Check the layer index\n", 89 | " layer_index = int(name.split('.')[2]) # This splits the name and gets the third element\n", 90 | " if layer_index > current_layer:\n", 91 | " current_layer = layer_index\n", 92 | " if current_layer > n_layers-1:\n", 93 | " break # Stop after reaching the specified main layer\n", 94 | "\n", 95 | " norm_layer = model.model.norm # Adjust this path based on your model's architecture\n", 96 | " if hasattr(norm_layer, 'weight') and norm_layer.weight is not None:\n", 97 | " params['model.norm.weight'] = norm_layer.weight.data.clone()\n", 98 | " if hasattr(norm_layer, 'bias') and norm_layer.bias is not None:\n", 99 | " params['model.norm.bias'] = norm_layer.bias.data.clone()\n", 100 | "\n", 101 | " lm_head = reference_model.lm_head\n", 102 | " if hasattr(lm_head, 'weight') and lm_head.weight is not None:\n", 103 | " params[\"lm_head.weight\"] = lm_head.weight.data\n", 104 | " if hasattr(lm_head, 'bias') and lm_head.bias is not None:\n", 105 | " params[\"lm_head.bias\"] = lm_head.bias.data\n", 106 | "\n", 107 | " return params\n" 108 | ] 109 | }, 110 | { 111 | "cell_type": "code", 112 | "execution_count": null, 113 | "metadata": {}, 114 | "outputs": [], 115 | "source": [ 116 | "target_model_n_layers = 24\n", 117 | "pretrained_weights = extract_model_weights(model, target_model_n_layers)" 118 | ] 119 | }, 120 | { 121 | "cell_type": "code", 122 | "execution_count": null, 123 | "metadata": {}, 124 | "outputs": [], 125 | "source": [ 126 | "from transformers import AutoModelForCausalLM, AutoConfig\n", 127 | "config = AutoConfig.from_pretrained(model_name)\n", 128 | "config.num_hidden_layers = target_model_n_layers\n", 129 | "target_model = AutoModelForCausalLM.from_config(config)\n" 130 | ] 131 | }, 132 | { 133 | "cell_type": "code", 134 | "execution_count": null, 135 | "metadata": {}, 136 | "outputs": [], 137 | "source": [ 138 | "target_model_size = count_parameters(target_model)" 139 | ] 140 | }, 141 | { 142 | "cell_type": "code", 143 | "execution_count": null, 144 | "metadata": {}, 145 | "outputs": [], 146 | "source": [ 147 | "target_model.load_state_dict(pretrained_weights)\n" 148 | ] 149 | }, 150 | { 151 | "cell_type": "code", 152 | "execution_count": null, 153 | "metadata": {}, 154 | "outputs": [], 155 | "source": [ 156 | "inputs = tokenizer(\n", 157 | "[\n", 158 | " \"Who created Python?\"\n", 159 | "], return_tensors = \"pt\")\n", 160 | "\n", 161 | "# inputs = tokenizer.apply_chat_template(\n", 162 | "# [\n", 163 | "# # {\"content\":\"\",\"role\":\"system\"},\n", 164 | "# {\"content\":\"\"\"Given the question: Read the article and select the best\n", 165 | "# answer. Article: Can you swim? Do you like swimming? Well, how can you\n", 166 | "# learn to swim? I think the best way is to go into the water and learn.\n", 167 | "# I'm afraid you'll never learn to swim just by reading books about\n", 168 | "# Swimming or looking at others swimming. It's the same with the English\n", 169 | "# study. We must practice, practice and practice. Listening and speaking\n", 170 | "# are very important for beginners. We can listen to English programs on radio.\n", 171 | "# You may just understand a few words. It doesn't matter. Just be relaxed,\n", 172 | "# try to catch every word. Somebody may be a good listener, but he is afraid\n", 173 | "# to speak because he's afraid of making mistakes. You know we sometimes\n", 174 | "# make mistakes when we speak Chinese. Don't be afraid. We must be brave.\n", 175 | "# If you really want to learn English well, you must try to speak with\n", 176 | "# everyone as long as he knows English. When there's nobody to talk with,\n", 177 | "# you can talk to yourself in English. It's interesting and also a good\n", 178 | "# way to practice your spoken English. Remember, the more you speak, the\n", 179 | "# fewer mistakes you'll make. Reading and writing are more important for\n", 180 | "# senior school students. First we must choose the books we're interested\n", 181 | "# in. A lot of reading will improve your language sense.\n", 182 | "# This is very important. It's easier said than done. Well, let's do\n", 183 | "# more practice from now on. I'm sure you'll learn English well in this\n", 184 | "# way. ,A, B, C, D,. (10)\n", 185 | "# Question: Which is the best title for the passage?\n", 186 | "# Options:\n", 187 | "# A: How to Learn English.\n", 188 | "# B: Easier Said Than Done.\n", 189 | "# C: Listen First, Speak Second.\n", 190 | "# D: How to learn to Swim.\\n\n", 191 | "# The answer is:\"\"\",\"role\":\"user\"}\n", 192 | "# ], add_generation_prompt=True, return_tensors='pt',\n", 193 | "# )" 194 | ] 195 | }, 196 | { 197 | "cell_type": "code", 198 | "execution_count": null, 199 | "metadata": {}, 200 | "outputs": [], 201 | "source": [ 202 | "from transformers import TextStreamer\n", 203 | "text_streamer = TextStreamer(tokenizer)\n", 204 | "_ = target_model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200)" 205 | ] 206 | }, 207 | { 208 | "cell_type": "code", 209 | "execution_count": null, 210 | "metadata": {}, 211 | "outputs": [], 212 | "source": [ 213 | "target_model.push_to_hub(\"Llama-3-6B-Instruct-v0.1\")\n", 214 | "tokenizer.push_to_hub(\"Llama-3-6B-Instruct-v0.1\")" 215 | ] 216 | }, 217 | { 218 | "cell_type": "markdown", 219 | "metadata": {}, 220 | "source": [ 221 | "# Downcycling by getting the first X layers and last X layers \n", 222 | "\n", 223 | "Where X is a N/2.\n", 224 | "\n", 225 | "For instance, if our target number layers is 24 then X will be 24/2 = 12." 226 | ] 227 | }, 228 | { 229 | "cell_type": "code", 230 | "execution_count": null, 231 | "metadata": {}, 232 | "outputs": [], 233 | "source": [ 234 | "target_model_n_layers = 24\n", 235 | "weights_1 = model.model.layers[:target_model_n_layers//2]\n", 236 | "weights_2 = model.model.layers[-target_model_n_layers//2:]\n", 237 | "\n", 238 | "# Assuming 'model' is your pre-existing large model\n", 239 | "# This part is conceptual, assuming the model is split into exactly 24 layers evenly.\n", 240 | "\n", 241 | "# Extract weights for the first 12 layers\n", 242 | "weights_1 = {f'model.layers.{k}': v.clone() for k, v in weights_1.state_dict().items() }\n", 243 | "\n", 244 | "# Extract weights for the last 12 layers\n", 245 | "weights_2 = {f'model.layers.{k}': v.clone() for k, v in weights_2.state_dict().items()}\n" 246 | ] 247 | }, 248 | { 249 | "cell_type": "code", 250 | "execution_count": null, 251 | "metadata": {}, 252 | "outputs": [], 253 | "source": [ 254 | "# Get remainder modules weights\n", 255 | "weights_1[\"model.embed_tokens.weight\"] = model.model.state_dict()['embed_tokens.weight']\n", 256 | "weights_2[\"model.norm.weight\"] = model.model.state_dict()['norm.weight']\n", 257 | "weights_2[\"lm_head.weight\"] = model.state_dict()['lm_head.weight']" 258 | ] 259 | }, 260 | { 261 | "cell_type": "code", 262 | "execution_count": null, 263 | "metadata": {}, 264 | "outputs": [], 265 | "source": [ 266 | "import re\n", 267 | "def update_layer_numbers(state_dict, x_size):\n", 268 | " new_state_dict = {}\n", 269 | " # Regular expression to find and manipulate the layer numbers\n", 270 | " pattern = re.compile(r'model.layers.(\\d+)')\n", 271 | "\n", 272 | " for key, value in state_dict.items():\n", 273 | " # Search for the pattern and update\n", 274 | " new_key = pattern.sub(lambda x: f\"model.layers.{int(x.group(1)) + x_size}\", key)\n", 275 | " new_state_dict[new_key] = value\n", 276 | "\n", 277 | " return new_state_dict\n", 278 | "\n", 279 | "\n", 280 | "weights_2 = update_layer_numbers(weights_2, target_model_n_layers//2)" 281 | ] 282 | }, 283 | { 284 | "cell_type": "code", 285 | "execution_count": null, 286 | "metadata": {}, 287 | "outputs": [], 288 | "source": [ 289 | "from transformers import AutoModelForCausalLM, AutoConfig\n", 290 | "config = AutoConfig.from_pretrained(model_name)\n", 291 | "config.num_hidden_layers = target_model_n_layers\n", 292 | "target_model = AutoModelForCausalLM.from_config(config)" 293 | ] 294 | }, 295 | { 296 | "cell_type": "code", 297 | "execution_count": null, 298 | "metadata": {}, 299 | "outputs": [], 300 | "source": [ 301 | "target_model.load_state_dict({**weights_1, **weights_2})" 302 | ] 303 | }, 304 | { 305 | "cell_type": "code", 306 | "execution_count": null, 307 | "metadata": {}, 308 | "outputs": [], 309 | "source": [ 310 | "count_parameters(target_model)" 311 | ] 312 | }, 313 | { 314 | "cell_type": "code", 315 | "execution_count": null, 316 | "metadata": {}, 317 | "outputs": [], 318 | "source": [ 319 | "inputs = tokenizer.apply_chat_template(\n", 320 | " [\n", 321 | " # {\"content\":\"\",\"role\":\"system\"},\n", 322 | " {\"content\":\"\"\"Given the question: Read the article and select the best\n", 323 | " answer. Article: Can you swim? Do you like swimming? Well, how can you\n", 324 | " learn to swim? I think the best way is to go into the water and learn.\n", 325 | " I'm afraid you'll never learn to swim just by reading books about\n", 326 | " Swimming or looking at others swimming. It's the same with the English\n", 327 | " study. We must practice, practice and practice. Listening and speaking\n", 328 | " are very important for beginners. We can listen to English programs on radio.\n", 329 | " You may just understand a few words. It doesn't matter. Just be relaxed,\n", 330 | " try to catch every word. Somebody may be a good listener, but he is afraid\n", 331 | " to speak because he's afraid of making mistakes. You know we sometimes\n", 332 | " make mistakes when we speak Chinese. Don't be afraid. We must be brave.\n", 333 | " If you really want to learn English well, you must try to speak with\n", 334 | " everyone as long as he knows English. When there's nobody to talk with,\n", 335 | " you can talk to yourself in English. It's interesting and also a good\n", 336 | " way to practice your spoken English. Remember, the more you speak, the\n", 337 | " fewer mistakes you'll make. Reading and writing are more important for\n", 338 | " senior school students. First we must choose the books we're interested\n", 339 | " in. A lot of reading will improve your language sense.\n", 340 | " This is very important. It's easier said than done. Well, let's do\n", 341 | " more practice from now on. I'm sure you'll learn English well in this\n", 342 | " way. ,A, B, C, D,. (10)\n", 343 | " Question: Which is the best title for the passage?\n", 344 | " Options:\n", 345 | " A: How to Learn English.\n", 346 | " B: Easier Said Than Done.\n", 347 | " C: Listen First, Speak Second.\n", 348 | " D: How to learn to Swim.\\n\n", 349 | " The answer is:\"\"\",\"role\":\"user\"}\n", 350 | " ], add_generation_prompt=True, return_tensors='pt',\n", 351 | ")" 352 | ] 353 | }, 354 | { 355 | "cell_type": "code", 356 | "execution_count": null, 357 | "metadata": {}, 358 | "outputs": [], 359 | "source": [ 360 | "from transformers import TextStreamer\n", 361 | "text_streamer = TextStreamer(tokenizer)\n", 362 | "_ = target_model.generate(inputs, streamer = text_streamer, max_new_tokens = 128)" 363 | ] 364 | }, 365 | { 366 | "cell_type": "code", 367 | "execution_count": null, 368 | "metadata": {}, 369 | "outputs": [], 370 | "source": [ 371 | "target_model.push_to_hub(\"Llama-3-6B-Instruct-Granite-v0.1\")\n", 372 | "tokenizer.push_to_hub(\"Llama-3-6B-Instruct-Granite-v0.1\")" 373 | ] 374 | } 375 | ], 376 | "metadata": { 377 | "kernelspec": { 378 | "display_name": "mlx_code", 379 | "language": "python", 380 | "name": "python3" 381 | }, 382 | "language_info": { 383 | "codemirror_mode": { 384 | "name": "ipython", 385 | "version": 3 386 | }, 387 | "file_extension": ".py", 388 | "mimetype": "text/x-python", 389 | "name": "python", 390 | "nbconvert_exporter": "python", 391 | "pygments_lexer": "ipython3", 392 | "version": "3.10.14" 393 | } 394 | }, 395 | "nbformat": 4, 396 | "nbformat_minor": 2 397 | } 398 | -------------------------------------------------------------------------------- /Llama-3/Part 2/Config/train-llama-3-6B.yml: -------------------------------------------------------------------------------- 1 | base_model: prince-canuma/Llama-3-6B-v0 2 | model_type: AutoModelForCausalLM 3 | tokenizer_type: AutoTokenizer 4 | 5 | load_in_8bit: false 6 | load_in_4bit: true 7 | strict: false 8 | 9 | datasets: 10 | - path: prince-canuma/fineweb-CC-MAIN-2024-10-1B-en 11 | type: completion 12 | split: train 13 | dataset_prepared_path: last_run_prepared 14 | val_set_size: 0.001 15 | output_dir: ./llama-3-6b 16 | save_safetensors: true 17 | adapter: qlora 18 | lora_model_dir: 19 | 20 | sequence_len: 8192 21 | sample_packing: false 22 | pad_to_sequence_len: false 23 | 24 | lora_r: 128 25 | lora_alpha: 128 26 | lora_dropout: 0.05 27 | lora_target_modules: 28 | lora_target_linear: true 29 | lora_fan_in_fan_out: 30 | 31 | 32 | wandb_project: llama-3-6b 33 | wandb_entity: 34 | wandb_watch: 35 | wandb_name: 36 | wandb_log_model: 37 | 38 | gradient_accumulation_steps: 8 39 | micro_batch_size: 2 40 | num_epochs: 2 41 | optimizer: paged_adamw_32bit 42 | lr_scheduler: cosine 43 | learning_rate: 2e-4 44 | 45 | train_on_inputs: false 46 | group_by_length: false 47 | bf16: auto 48 | fp16: 49 | tf32: false 50 | 51 | gradient_checkpointing: true 52 | early_stopping_patience: 53 | resume_from_checkpoint: 54 | local_rank: 55 | logging_steps: 1 56 | xformers_attention: 57 | flash_attention: true 58 | 59 | warmup_steps: 100 60 | evals_per_epoch: 4 61 | eval_table_size: 62 | save_steps: 4000 63 | debug: 64 | deepspeed: 65 | weight_decay: 0.0 66 | fsdp: 67 | fsdp_config: 68 | special_tokens: 69 | pad_token: "<|reserved_special_token_0|>" 70 | 71 | -------------------------------------------------------------------------------- /Llama-3/Part 2/Downcycling_Comparision.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": {}, 7 | "outputs": [], 8 | "source": [ 9 | "! pip install transformers torch accelerate huggingface-hub huggingface-cli hf-transfer" 10 | ] 11 | }, 12 | { 13 | "cell_type": "code", 14 | "execution_count": null, 15 | "metadata": {}, 16 | "outputs": [], 17 | "source": [ 18 | "from transformers import TextStreamer\n", 19 | "\n", 20 | "def count_parameters(model):\n", 21 | " # Calculate the number of parameters in billions\n", 22 | " num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 10**9\n", 23 | " print(f\"Model size: {num_params:.3f}B parameters\")\n", 24 | " return int(num_params)\n", 25 | "\n", 26 | "def generate(model, tokenizer, inputs, max_new_tokens=50):\n", 27 | " text_streamer = TextStreamer(tokenizer)\n", 28 | " _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = max_new_tokens)" 29 | ] 30 | }, 31 | { 32 | "cell_type": "markdown", 33 | "metadata": {}, 34 | "source": [ 35 | "" 36 | ] 37 | }, 38 | { 39 | "cell_type": "markdown", 40 | "metadata": {}, 41 | "source": [ 42 | "## Load Untrained Downcycled Model" 43 | ] 44 | }, 45 | { 46 | "cell_type": "code", 47 | "execution_count": null, 48 | "metadata": {}, 49 | "outputs": [], 50 | "source": [ 51 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 52 | "import os\n", 53 | "\n", 54 | "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n", 55 | "\n", 56 | "# Load model, config and tokenizer\n", 57 | "model_name = \"prince-canuma/Llama-3-6B-v0\"\n", 58 | "untrained_model = AutoModelForCausalLM.from_pretrained(model_name)\n", 59 | "tokenizer = AutoTokenizer.from_pretrained(model_name)" 60 | ] 61 | }, 62 | { 63 | "cell_type": "markdown", 64 | "metadata": {}, 65 | "source": [ 66 | "## Load Pretrained Downcycled Model" 67 | ] 68 | }, 69 | { 70 | "cell_type": "code", 71 | "execution_count": null, 72 | "metadata": {}, 73 | "outputs": [], 74 | "source": [ 75 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n", 76 | "\n", 77 | "\n", 78 | "# Load model, config and tokenizer\n", 79 | "model_name = \"prince-canuma/Llama-3-6B-v0.1\"\n", 80 | "model = AutoModelForCausalLM.from_pretrained(model_name)\n", 81 | "tokenizer = AutoTokenizer.from_pretrained(model_name)" 82 | ] 83 | }, 84 | { 85 | "cell_type": "code", 86 | "execution_count": null, 87 | "metadata": {}, 88 | "outputs": [], 89 | "source": [ 90 | "count_parameters(model)" 91 | ] 92 | }, 93 | { 94 | "cell_type": "code", 95 | "execution_count": null, 96 | "metadata": {}, 97 | "outputs": [], 98 | "source": [ 99 | "inputs = tokenizer(\n", 100 | "[\n", 101 | " \"The Eifel tower is located in\"\n", 102 | "], return_tensors = \"pt\")\n" 103 | ] 104 | }, 105 | { 106 | "cell_type": "code", 107 | "execution_count": null, 108 | "metadata": {}, 109 | "outputs": [], 110 | "source": [ 111 | "generate(untrained_model, tokenizer, inputs)" 112 | ] 113 | }, 114 | { 115 | "cell_type": "code", 116 | "execution_count": null, 117 | "metadata": {}, 118 | "outputs": [], 119 | "source": [ 120 | "generate(model, tokenizer, inputs)" 121 | ] 122 | } 123 | ], 124 | "metadata": { 125 | "kernelspec": { 126 | "display_name": "mlx_code", 127 | "language": "python", 128 | "name": "python3" 129 | }, 130 | "language_info": { 131 | "codemirror_mode": { 132 | "name": "ipython", 133 | "version": 3 134 | }, 135 | "file_extension": ".py", 136 | "mimetype": "text/x-python", 137 | "name": "python", 138 | "nbconvert_exporter": "python", 139 | "pygments_lexer": "ipython3", 140 | "version": "3.10.14" 141 | } 142 | }, 143 | "nbformat": 4, 144 | "nbformat_minor": 2 145 | } 146 | -------------------------------------------------------------------------------- /Llama-3/Part 2/FineWeb10B.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "cells": [ 3 | { 4 | "cell_type": "code", 5 | "execution_count": null, 6 | "metadata": { 7 | "colab": { 8 | "base_uri": "https://localhost:8080/" 9 | }, 10 | "id": "KDeqo4iUPPK_", 11 | "outputId": "0726a097-ea22-481d-c5ce-7f330576053e" 12 | }, 13 | "outputs": [], 14 | "source": [ 15 | "!pip install datasets huggingface-hub[cli] hf-transfer pyarrow" 16 | ] 17 | }, 18 | { 19 | "cell_type": "code", 20 | "execution_count": null, 21 | "metadata": { 22 | "colab": { 23 | "base_uri": "https://localhost:8080/", 24 | "height": 205, 25 | "referenced_widgets": [ 26 | "3b6f554968d54e70a8e9ece4763fa612", 27 | "22b34f9b327746dd93ebe6538482ffdc", 28 | "0b69c40ac70542d7809ae6240c6ea8d3", 29 | "2921d2c9d1a7426bab54397a71772757", 30 | "a1e41ea2a9dd4138bcba2643d472ec78", 31 | "31ae5b1ddf8c405d9b97c901043365d7", 32 | "2e15dd36207146789d737ad4077d368b", 33 | "e4457ef8ca424b74a546e2fa85cdf69f", 34 | "53c8d989d57443ce8cd5861721bbdb41", 35 | "0eb083ee46f743d4b429bf6612da55a6", 36 | "048a01b6f7034148b5fa53d0e33088b0", 37 | "b9da0257d25e47d087dda0644be79d0f", 38 | "952a134784564a40a26f60fdc24a4b29", 39 | "c86d3e5c71804299bd3626eda5ad6bec", 40 | "3d78d66f951c48969c2036764cd415d0", 41 | "ed5c1bfee6084b40953064d2cf66ebdd", 42 | "fd32ad532d594347849bf1aa89fa0b15", 43 | "7c5d9781693245f8bf888e8f909f097f", 44 | "4289ac665bd3494b8091af46aa03d27b", 45 | "05eafd28789747c5b8541e81d392aef4", 46 | "bc96b42534554357872d4497616054c4", 47 | "6a59d212d5184dc39ce77c279ebca348" 48 | ] 49 | }, 50 | "id": "jh97g8-gPe3H", 51 | "outputId": "e0c87cbd-939e-4faf-ab9d-87993f1a80ae" 52 | }, 53 | "outputs": [], 54 | "source": [ 55 | "from datasets import load_dataset\n", 56 | "# use name=\"sample-10BT\" to use the 10BT sample\n", 57 | "fw = load_dataset(\"HuggingFaceFW/fineweb\", name=\"CC-MAIN-2024-10\", split=\"train\", streaming=True)\n" 58 | ] 59 | }, 60 | { 61 | "cell_type": "code", 62 | "execution_count": null, 63 | "metadata": { 64 | "colab": { 65 | "base_uri": "https://localhost:8080/" 66 | }, 67 | "id": "yPWI91xuQGaV", 68 | "outputId": "00c4901f-dd90-411f-da71-8a831ca4241d" 69 | }, 70 | "outputs": [], 71 | "source": [ 72 | "fw" 73 | ] 74 | }, 75 | { 76 | "cell_type": "code", 77 | "execution_count": null, 78 | "metadata": { 79 | "id": "oDnktugVQHru" 80 | }, 81 | "outputs": [], 82 | "source": [ 83 | "filtered_dataset = fw.filter(lambda example: example['language'] == 'en')" 84 | ] 85 | }, 86 | { 87 | "cell_type": "code", 88 | "execution_count": null, 89 | "metadata": { 90 | "id": "Y9StbBlrQhgf" 91 | }, 92 | "outputs": [], 93 | "source": [ 94 | "from tqdm import tqdm\n", 95 | "\n", 96 | "# Wrapping the 'take' method call with tqdm to display a progress bar\n", 97 | "dataset_10B = [item for item in tqdm(filtered_dataset.take(15000000), total=15000000)]\n" 98 | ] 99 | }, 100 | { 101 | "cell_type": "code", 102 | "execution_count": null, 103 | "metadata": { 104 | "id": "sSSY3K9kSwW5" 105 | }, 106 | "outputs": [], 107 | "source": [ 108 | "import pandas as pd\n", 109 | "dataset = pd.DataFrame(dataset_10B, index=None)" 110 | ] 111 | }, 112 | { 113 | "cell_type": "code", 114 | "execution_count": null, 115 | "metadata": { 116 | "colab": { 117 | "base_uri": "https://localhost:8080/", 118 | "height": 1000 119 | }, 120 | "id": "EXuxcMFBUYjQ", 121 | "outputId": "0881e898-89c2-4ebc-a3e4-7d479db33a77" 122 | }, 123 | "outputs": [], 124 | "source": [ 125 | "dataset" 126 | ] 127 | }, 128 | { 129 | "cell_type": "code", 130 | "execution_count": null, 131 | "metadata": { 132 | "colab": { 133 | "base_uri": "https://localhost:8080/" 134 | }, 135 | "id": "Ex1lKDNIVlTj", 136 | "outputId": "9789ff15-9a27-4bd8-b307-34039e3a2983" 137 | }, 138 | "outputs": [], 139 | "source": [ 140 | "dataset[\"token_count\"].sum()" 141 | ] 142 | }, 143 | { 144 | "cell_type": "code", 145 | "execution_count": null, 146 | "metadata": {}, 147 | "outputs": [], 148 | "source": [ 149 | "dataset.to_csv('fineweb-CC-MAIN-2024-10-1B-en.csv', index=False)" 150 | ] 151 | }, 152 | { 153 | "cell_type": "code", 154 | "execution_count": null, 155 | "metadata": {}, 156 | "outputs": [], 157 | "source": [ 158 | "from datasets import load_dataset\n", 159 | "\n", 160 | "\n", 161 | "dataset = load_dataset(\"csv\", data_files = {\"train\":\"fineweb-CC-MAIN-2024-10-8B-en.csv\"})" 162 | ] 163 | }, 164 | { 165 | "cell_type": "code", 166 | "execution_count": null, 167 | "metadata": {}, 168 | "outputs": [], 169 | "source": [ 170 | "dataset.push_to_hub(\"fineweb-CC-MAIN-2024-10-8B-en\")" 171 | ] 172 | }, 173 | { 174 | "cell_type": "code", 175 | "execution_count": null, 176 | "metadata": {}, 177 | "outputs": [], 178 | "source": [ 179 | "N = 8300000 # Specify the number of rows you want\n", 180 | "dataset_6b = dataset[\"train\"].select(range(N))" 181 | ] 182 | }, 183 | { 184 | "cell_type": "code", 185 | "execution_count": null, 186 | "metadata": {}, 187 | "outputs": [], 188 | "source": [ 189 | "sum(dataset_6b['token_count'])" 190 | ] 191 | }, 192 | { 193 | "cell_type": "code", 194 | "execution_count": null, 195 | "metadata": {}, 196 | "outputs": [], 197 | "source": [ 198 | "dataset_6b.push_to_hub(\"fineweb-CC-MAIN-2024-10-6B-en\")" 199 | ] 200 | }, 201 | { 202 | "cell_type": "code", 203 | "execution_count": null, 204 | "metadata": {}, 205 | "outputs": [], 206 | "source": [ 207 | "from datasets import Dataset\n", 208 | "dataset_1b = Dataset.from_pandas(dataset, preserve_index=False)\n", 209 | "\n", 210 | "N = 1500000 # Specify the number of rows you want\n", 211 | "dataset_1b = dataset[\"train\"].select(range(N))" 212 | ] 213 | }, 214 | { 215 | "cell_type": "code", 216 | "execution_count": null, 217 | "metadata": {}, 218 | "outputs": [], 219 | "source": [ 220 | "dataset_1b.push_to_hub(\"fineweb-CC-MAIN-2024-10-1B-en\")" 221 | ] 222 | } 223 | ], 224 | "metadata": { 225 | "colab": { 226 | "provenance": [] 227 | }, 228 | "kernelspec": { 229 | "display_name": "Python 3 (ipykernel)", 230 | "language": "python", 231 | "name": "python3" 232 | }, 233 | "language_info": { 234 | "codemirror_mode": { 235 | "name": "ipython", 236 | "version": 3 237 | }, 238 | "file_extension": ".py", 239 | "mimetype": "text/x-python", 240 | "name": "python", 241 | "nbconvert_exporter": "python", 242 | "pygments_lexer": "ipython3", 243 | "version": "3.10.14" 244 | }, 245 | "widgets": { 246 | "application/vnd.jupyter.widget-state+json": { 247 | "048a01b6f7034148b5fa53d0e33088b0": { 248 | "model_module": "@jupyter-widgets/controls", 249 | "model_module_version": "1.5.0", 250 | "model_name": "DescriptionStyleModel", 251 | "state": { 252 | "_model_module": "@jupyter-widgets/controls", 253 | "_model_module_version": "1.5.0", 254 | "_model_name": "DescriptionStyleModel", 255 | "_view_count": null, 256 | "_view_module": "@jupyter-widgets/base", 257 | "_view_module_version": "1.2.0", 258 | "_view_name": "StyleView", 259 | "description_width": "" 260 | } 261 | }, 262 | "05eafd28789747c5b8541e81d392aef4": { 263 | "model_module": "@jupyter-widgets/controls", 264 | "model_module_version": "1.5.0", 265 | "model_name": "ProgressStyleModel", 266 | "state": { 267 | "_model_module": "@jupyter-widgets/controls", 268 | "_model_module_version": "1.5.0", 269 | "_model_name": "ProgressStyleModel", 270 | "_view_count": null, 271 | "_view_module": "@jupyter-widgets/base", 272 | "_view_module_version": "1.2.0", 273 | "_view_name": "StyleView", 274 | "bar_color": null, 275 | "description_width": "" 276 | } 277 | }, 278 | "0b69c40ac70542d7809ae6240c6ea8d3": { 279 | "model_module": "@jupyter-widgets/controls", 280 | "model_module_version": "1.5.0", 281 | "model_name": "FloatProgressModel", 282 | "state": { 283 | "_dom_classes": [], 284 | "_model_module": "@jupyter-widgets/controls", 285 | "_model_module_version": "1.5.0", 286 | "_model_name": "FloatProgressModel", 287 | "_view_count": null, 288 | "_view_module": "@jupyter-widgets/controls", 289 | "_view_module_version": "1.5.0", 290 | "_view_name": "ProgressView", 291 | "bar_style": "success", 292 | "description": "", 293 | "description_tooltip": null, 294 | "layout": "IPY_MODEL_e4457ef8ca424b74a546e2fa85cdf69f", 295 | "max": 23032, 296 | "min": 0, 297 | "orientation": "horizontal", 298 | "style": "IPY_MODEL_53c8d989d57443ce8cd5861721bbdb41", 299 | "value": 23032 300 | } 301 | }, 302 | "0eb083ee46f743d4b429bf6612da55a6": { 303 | "model_module": "@jupyter-widgets/base", 304 | "model_module_version": "1.2.0", 305 | "model_name": "LayoutModel", 306 | "state": { 307 | "_model_module": "@jupyter-widgets/base", 308 | "_model_module_version": "1.2.0", 309 | "_model_name": "LayoutModel", 310 | "_view_count": null, 311 | "_view_module": "@jupyter-widgets/base", 312 | "_view_module_version": "1.2.0", 313 | "_view_name": "LayoutView", 314 | "align_content": null, 315 | "align_items": null, 316 | "align_self": null, 317 | "border": null, 318 | "bottom": null, 319 | "display": null, 320 | "flex": null, 321 | "flex_flow": null, 322 | "grid_area": null, 323 | "grid_auto_columns": null, 324 | "grid_auto_flow": null, 325 | "grid_auto_rows": null, 326 | "grid_column": null, 327 | "grid_gap": null, 328 | "grid_row": null, 329 | "grid_template_areas": null, 330 | "grid_template_columns": null, 331 | "grid_template_rows": null, 332 | "height": null, 333 | "justify_content": null, 334 | "justify_items": null, 335 | "left": null, 336 | "margin": null, 337 | "max_height": null, 338 | "max_width": null, 339 | "min_height": null, 340 | "min_width": null, 341 | "object_fit": null, 342 | "object_position": null, 343 | "order": null, 344 | "overflow": null, 345 | "overflow_x": null, 346 | "overflow_y": null, 347 | "padding": null, 348 | "right": null, 349 | "top": null, 350 | "visibility": null, 351 | "width": null 352 | } 353 | }, 354 | "22b34f9b327746dd93ebe6538482ffdc": { 355 | "model_module": "@jupyter-widgets/controls", 356 | "model_module_version": "1.5.0", 357 | "model_name": "HTMLModel", 358 | "state": { 359 | "_dom_classes": [], 360 | "_model_module": "@jupyter-widgets/controls", 361 | "_model_module_version": "1.5.0", 362 | "_model_name": "HTMLModel", 363 | "_view_count": null, 364 | "_view_module": "@jupyter-widgets/controls", 365 | "_view_module_version": "1.5.0", 366 | "_view_name": "HTMLView", 367 | "description": "", 368 | "description_tooltip": null, 369 | "layout": "IPY_MODEL_31ae5b1ddf8c405d9b97c901043365d7", 370 | "placeholder": "​", 371 | "style": "IPY_MODEL_2e15dd36207146789d737ad4077d368b", 372 | "value": "Resolving data files: 100%" 373 | } 374 | }, 375 | "2921d2c9d1a7426bab54397a71772757": { 376 | "model_module": "@jupyter-widgets/controls", 377 | "model_module_version": "1.5.0", 378 | "model_name": "HTMLModel", 379 | "state": { 380 | "_dom_classes": [], 381 | "_model_module": "@jupyter-widgets/controls", 382 | "_model_module_version": "1.5.0", 383 | "_model_name": "HTMLModel", 384 | "_view_count": null, 385 | "_view_module": "@jupyter-widgets/controls", 386 | "_view_module_version": "1.5.0", 387 | "_view_name": "HTMLView", 388 | "description": "", 389 | "description_tooltip": null, 390 | "layout": "IPY_MODEL_0eb083ee46f743d4b429bf6612da55a6", 391 | "placeholder": "​", 392 | "style": "IPY_MODEL_048a01b6f7034148b5fa53d0e33088b0", 393 | "value": " 23032/23032 [00:00<00:00,  7.06it/s]" 394 | } 395 | }, 396 | "2e15dd36207146789d737ad4077d368b": { 397 | "model_module": "@jupyter-widgets/controls", 398 | "model_module_version": "1.5.0", 399 | "model_name": "DescriptionStyleModel", 400 | "state": { 401 | "_model_module": "@jupyter-widgets/controls", 402 | "_model_module_version": "1.5.0", 403 | "_model_name": "DescriptionStyleModel", 404 | "_view_count": null, 405 | "_view_module": "@jupyter-widgets/base", 406 | "_view_module_version": "1.2.0", 407 | "_view_name": "StyleView", 408 | "description_width": "" 409 | } 410 | }, 411 | "31ae5b1ddf8c405d9b97c901043365d7": { 412 | "model_module": "@jupyter-widgets/base", 413 | "model_module_version": "1.2.0", 414 | "model_name": "LayoutModel", 415 | "state": { 416 | "_model_module": "@jupyter-widgets/base", 417 | "_model_module_version": "1.2.0", 418 | "_model_name": "LayoutModel", 419 | "_view_count": null, 420 | "_view_module": "@jupyter-widgets/base", 421 | "_view_module_version": "1.2.0", 422 | "_view_name": "LayoutView", 423 | "align_content": null, 424 | "align_items": null, 425 | "align_self": null, 426 | "border": null, 427 | "bottom": null, 428 | "display": null, 429 | "flex": null, 430 | "flex_flow": null, 431 | "grid_area": null, 432 | "grid_auto_columns": null, 433 | "grid_auto_flow": null, 434 | "grid_auto_rows": null, 435 | "grid_column": null, 436 | "grid_gap": null, 437 | "grid_row": null, 438 | "grid_template_areas": null, 439 | "grid_template_columns": null, 440 | "grid_template_rows": null, 441 | "height": null, 442 | "justify_content": null, 443 | "justify_items": null, 444 | "left": null, 445 | "margin": null, 446 | "max_height": null, 447 | "max_width": null, 448 | "min_height": null, 449 | "min_width": null, 450 | "object_fit": null, 451 | "object_position": null, 452 | "order": null, 453 | "overflow": null, 454 | "overflow_x": null, 455 | "overflow_y": null, 456 | "padding": null, 457 | "right": null, 458 | "top": null, 459 | "visibility": null, 460 | "width": null 461 | } 462 | }, 463 | "3b6f554968d54e70a8e9ece4763fa612": { 464 | "model_module": "@jupyter-widgets/controls", 465 | "model_module_version": "1.5.0", 466 | "model_name": "HBoxModel", 467 | "state": { 468 | "_dom_classes": [], 469 | "_model_module": "@jupyter-widgets/controls", 470 | "_model_module_version": "1.5.0", 471 | "_model_name": "HBoxModel", 472 | "_view_count": null, 473 | "_view_module": "@jupyter-widgets/controls", 474 | "_view_module_version": "1.5.0", 475 | "_view_name": "HBoxView", 476 | "box_style": "", 477 | "children": [ 478 | "IPY_MODEL_22b34f9b327746dd93ebe6538482ffdc", 479 | "IPY_MODEL_0b69c40ac70542d7809ae6240c6ea8d3", 480 | "IPY_MODEL_2921d2c9d1a7426bab54397a71772757" 481 | ], 482 | "layout": "IPY_MODEL_a1e41ea2a9dd4138bcba2643d472ec78" 483 | } 484 | }, 485 | "3d78d66f951c48969c2036764cd415d0": { 486 | "model_module": "@jupyter-widgets/controls", 487 | "model_module_version": "1.5.0", 488 | "model_name": "HTMLModel", 489 | "state": { 490 | "_dom_classes": [], 491 | "_model_module": "@jupyter-widgets/controls", 492 | "_model_module_version": "1.5.0", 493 | "_model_name": "HTMLModel", 494 | "_view_count": null, 495 | "_view_module": "@jupyter-widgets/controls", 496 | "_view_module_version": "1.5.0", 497 | "_view_name": "HTMLView", 498 | "description": "", 499 | "description_tooltip": null, 500 | "layout": "IPY_MODEL_bc96b42534554357872d4497616054c4", 501 | "placeholder": "​", 502 | "style": "IPY_MODEL_6a59d212d5184dc39ce77c279ebca348", 503 | "value": " 250/250 [00:00<00:00, 6483.86it/s]" 504 | } 505 | }, 506 | "4289ac665bd3494b8091af46aa03d27b": { 507 | "model_module": "@jupyter-widgets/base", 508 | "model_module_version": "1.2.0", 509 | "model_name": "LayoutModel", 510 | "state": { 511 | "_model_module": "@jupyter-widgets/base", 512 | "_model_module_version": "1.2.0", 513 | "_model_name": "LayoutModel", 514 | "_view_count": null, 515 | "_view_module": "@jupyter-widgets/base", 516 | "_view_module_version": "1.2.0", 517 | "_view_name": "LayoutView", 518 | "align_content": null, 519 | "align_items": null, 520 | "align_self": null, 521 | "border": null, 522 | "bottom": null, 523 | "display": null, 524 | "flex": null, 525 | "flex_flow": null, 526 | "grid_area": null, 527 | "grid_auto_columns": null, 528 | "grid_auto_flow": null, 529 | "grid_auto_rows": null, 530 | "grid_column": null, 531 | "grid_gap": null, 532 | "grid_row": null, 533 | "grid_template_areas": null, 534 | "grid_template_columns": null, 535 | "grid_template_rows": null, 536 | "height": null, 537 | "justify_content": null, 538 | "justify_items": null, 539 | "left": null, 540 | "margin": null, 541 | "max_height": null, 542 | "max_width": null, 543 | "min_height": null, 544 | "min_width": null, 545 | "object_fit": null, 546 | "object_position": null, 547 | "order": null, 548 | "overflow": null, 549 | "overflow_x": null, 550 | "overflow_y": null, 551 | "padding": null, 552 | "right": null, 553 | "top": null, 554 | "visibility": null, 555 | "width": null 556 | } 557 | }, 558 | "53c8d989d57443ce8cd5861721bbdb41": { 559 | "model_module": "@jupyter-widgets/controls", 560 | "model_module_version": "1.5.0", 561 | "model_name": "ProgressStyleModel", 562 | "state": { 563 | "_model_module": "@jupyter-widgets/controls", 564 | "_model_module_version": "1.5.0", 565 | "_model_name": "ProgressStyleModel", 566 | "_view_count": null, 567 | "_view_module": "@jupyter-widgets/base", 568 | "_view_module_version": "1.2.0", 569 | "_view_name": "StyleView", 570 | "bar_color": null, 571 | "description_width": "" 572 | } 573 | }, 574 | "6a59d212d5184dc39ce77c279ebca348": { 575 | "model_module": "@jupyter-widgets/controls", 576 | "model_module_version": "1.5.0", 577 | "model_name": "DescriptionStyleModel", 578 | "state": { 579 | "_model_module": "@jupyter-widgets/controls", 580 | "_model_module_version": "1.5.0", 581 | "_model_name": "DescriptionStyleModel", 582 | "_view_count": null, 583 | "_view_module": "@jupyter-widgets/base", 584 | "_view_module_version": "1.2.0", 585 | "_view_name": "StyleView", 586 | "description_width": "" 587 | } 588 | }, 589 | "7c5d9781693245f8bf888e8f909f097f": { 590 | "model_module": "@jupyter-widgets/controls", 591 | "model_module_version": "1.5.0", 592 | "model_name": "DescriptionStyleModel", 593 | "state": { 594 | "_model_module": "@jupyter-widgets/controls", 595 | "_model_module_version": "1.5.0", 596 | "_model_name": "DescriptionStyleModel", 597 | "_view_count": null, 598 | "_view_module": "@jupyter-widgets/base", 599 | "_view_module_version": "1.2.0", 600 | "_view_name": "StyleView", 601 | "description_width": "" 602 | } 603 | }, 604 | "952a134784564a40a26f60fdc24a4b29": { 605 | "model_module": "@jupyter-widgets/controls", 606 | "model_module_version": "1.5.0", 607 | "model_name": "HTMLModel", 608 | "state": { 609 | "_dom_classes": [], 610 | "_model_module": "@jupyter-widgets/controls", 611 | "_model_module_version": "1.5.0", 612 | "_model_name": "HTMLModel", 613 | "_view_count": null, 614 | "_view_module": "@jupyter-widgets/controls", 615 | "_view_module_version": "1.5.0", 616 | "_view_name": "HTMLView", 617 | "description": "", 618 | "description_tooltip": null, 619 | "layout": "IPY_MODEL_fd32ad532d594347849bf1aa89fa0b15", 620 | "placeholder": "​", 621 | "style": "IPY_MODEL_7c5d9781693245f8bf888e8f909f097f", 622 | "value": "Resolving data files: 100%" 623 | } 624 | }, 625 | "a1e41ea2a9dd4138bcba2643d472ec78": { 626 | "model_module": "@jupyter-widgets/base", 627 | "model_module_version": "1.2.0", 628 | "model_name": "LayoutModel", 629 | "state": { 630 | "_model_module": "@jupyter-widgets/base", 631 | "_model_module_version": "1.2.0", 632 | "_model_name": "LayoutModel", 633 | "_view_count": null, 634 | "_view_module": "@jupyter-widgets/base", 635 | "_view_module_version": "1.2.0", 636 | "_view_name": "LayoutView", 637 | "align_content": null, 638 | "align_items": null, 639 | "align_self": null, 640 | "border": null, 641 | "bottom": null, 642 | "display": null, 643 | "flex": null, 644 | "flex_flow": null, 645 | "grid_area": null, 646 | "grid_auto_columns": null, 647 | "grid_auto_flow": null, 648 | "grid_auto_rows": null, 649 | "grid_column": null, 650 | "grid_gap": null, 651 | "grid_row": null, 652 | "grid_template_areas": null, 653 | "grid_template_columns": null, 654 | "grid_template_rows": null, 655 | "height": null, 656 | "justify_content": null, 657 | "justify_items": null, 658 | "left": null, 659 | "margin": null, 660 | "max_height": null, 661 | "max_width": null, 662 | "min_height": null, 663 | "min_width": null, 664 | "object_fit": null, 665 | "object_position": null, 666 | "order": null, 667 | "overflow": null, 668 | "overflow_x": null, 669 | "overflow_y": null, 670 | "padding": null, 671 | "right": null, 672 | "top": null, 673 | "visibility": null, 674 | "width": null 675 | } 676 | }, 677 | "b9da0257d25e47d087dda0644be79d0f": { 678 | "model_module": "@jupyter-widgets/controls", 679 | "model_module_version": "1.5.0", 680 | "model_name": "HBoxModel", 681 | "state": { 682 | "_dom_classes": [], 683 | "_model_module": "@jupyter-widgets/controls", 684 | "_model_module_version": "1.5.0", 685 | "_model_name": "HBoxModel", 686 | "_view_count": null, 687 | "_view_module": "@jupyter-widgets/controls", 688 | "_view_module_version": "1.5.0", 689 | "_view_name": "HBoxView", 690 | "box_style": "", 691 | "children": [ 692 | "IPY_MODEL_952a134784564a40a26f60fdc24a4b29", 693 | "IPY_MODEL_c86d3e5c71804299bd3626eda5ad6bec", 694 | "IPY_MODEL_3d78d66f951c48969c2036764cd415d0" 695 | ], 696 | "layout": "IPY_MODEL_ed5c1bfee6084b40953064d2cf66ebdd" 697 | } 698 | }, 699 | "bc96b42534554357872d4497616054c4": { 700 | "model_module": "@jupyter-widgets/base", 701 | "model_module_version": "1.2.0", 702 | "model_name": "LayoutModel", 703 | "state": { 704 | "_model_module": "@jupyter-widgets/base", 705 | "_model_module_version": "1.2.0", 706 | "_model_name": "LayoutModel", 707 | "_view_count": null, 708 | "_view_module": "@jupyter-widgets/base", 709 | "_view_module_version": "1.2.0", 710 | "_view_name": "LayoutView", 711 | "align_content": null, 712 | "align_items": null, 713 | "align_self": null, 714 | "border": null, 715 | "bottom": null, 716 | "display": null, 717 | "flex": null, 718 | "flex_flow": null, 719 | "grid_area": null, 720 | "grid_auto_columns": null, 721 | "grid_auto_flow": null, 722 | "grid_auto_rows": null, 723 | "grid_column": null, 724 | "grid_gap": null, 725 | "grid_row": null, 726 | "grid_template_areas": null, 727 | "grid_template_columns": null, 728 | "grid_template_rows": null, 729 | "height": null, 730 | "justify_content": null, 731 | "justify_items": null, 732 | "left": null, 733 | "margin": null, 734 | "max_height": null, 735 | "max_width": null, 736 | "min_height": null, 737 | "min_width": null, 738 | "object_fit": null, 739 | "object_position": null, 740 | "order": null, 741 | "overflow": null, 742 | "overflow_x": null, 743 | "overflow_y": null, 744 | "padding": null, 745 | "right": null, 746 | "top": null, 747 | "visibility": null, 748 | "width": null 749 | } 750 | }, 751 | "c86d3e5c71804299bd3626eda5ad6bec": { 752 | "model_module": "@jupyter-widgets/controls", 753 | "model_module_version": "1.5.0", 754 | "model_name": "FloatProgressModel", 755 | "state": { 756 | "_dom_classes": [], 757 | "_model_module": "@jupyter-widgets/controls", 758 | "_model_module_version": "1.5.0", 759 | "_model_name": "FloatProgressModel", 760 | "_view_count": null, 761 | "_view_module": "@jupyter-widgets/controls", 762 | "_view_module_version": "1.5.0", 763 | "_view_name": "ProgressView", 764 | "bar_style": "success", 765 | "description": "", 766 | "description_tooltip": null, 767 | "layout": "IPY_MODEL_4289ac665bd3494b8091af46aa03d27b", 768 | "max": 250, 769 | "min": 0, 770 | "orientation": "horizontal", 771 | "style": "IPY_MODEL_05eafd28789747c5b8541e81d392aef4", 772 | "value": 250 773 | } 774 | }, 775 | "e4457ef8ca424b74a546e2fa85cdf69f": { 776 | "model_module": "@jupyter-widgets/base", 777 | "model_module_version": "1.2.0", 778 | "model_name": "LayoutModel", 779 | "state": { 780 | "_model_module": "@jupyter-widgets/base", 781 | "_model_module_version": "1.2.0", 782 | "_model_name": "LayoutModel", 783 | "_view_count": null, 784 | "_view_module": "@jupyter-widgets/base", 785 | "_view_module_version": "1.2.0", 786 | "_view_name": "LayoutView", 787 | "align_content": null, 788 | "align_items": null, 789 | "align_self": null, 790 | "border": null, 791 | "bottom": null, 792 | "display": null, 793 | "flex": null, 794 | "flex_flow": null, 795 | "grid_area": null, 796 | "grid_auto_columns": null, 797 | "grid_auto_flow": null, 798 | "grid_auto_rows": null, 799 | "grid_column": null, 800 | "grid_gap": null, 801 | "grid_row": null, 802 | "grid_template_areas": null, 803 | "grid_template_columns": null, 804 | "grid_template_rows": null, 805 | "height": null, 806 | "justify_content": null, 807 | "justify_items": null, 808 | "left": null, 809 | "margin": null, 810 | "max_height": null, 811 | "max_width": null, 812 | "min_height": null, 813 | "min_width": null, 814 | "object_fit": null, 815 | "object_position": null, 816 | "order": null, 817 | "overflow": null, 818 | "overflow_x": null, 819 | "overflow_y": null, 820 | "padding": null, 821 | "right": null, 822 | "top": null, 823 | "visibility": null, 824 | "width": null 825 | } 826 | }, 827 | "ed5c1bfee6084b40953064d2cf66ebdd": { 828 | "model_module": "@jupyter-widgets/base", 829 | "model_module_version": "1.2.0", 830 | "model_name": "LayoutModel", 831 | "state": { 832 | "_model_module": "@jupyter-widgets/base", 833 | "_model_module_version": "1.2.0", 834 | "_model_name": "LayoutModel", 835 | "_view_count": null, 836 | "_view_module": "@jupyter-widgets/base", 837 | "_view_module_version": "1.2.0", 838 | "_view_name": "LayoutView", 839 | "align_content": null, 840 | "align_items": null, 841 | "align_self": null, 842 | "border": null, 843 | "bottom": null, 844 | "display": null, 845 | "flex": null, 846 | "flex_flow": null, 847 | "grid_area": null, 848 | "grid_auto_columns": null, 849 | "grid_auto_flow": null, 850 | "grid_auto_rows": null, 851 | "grid_column": null, 852 | "grid_gap": null, 853 | "grid_row": null, 854 | "grid_template_areas": null, 855 | "grid_template_columns": null, 856 | "grid_template_rows": null, 857 | "height": null, 858 | "justify_content": null, 859 | "justify_items": null, 860 | "left": null, 861 | "margin": null, 862 | "max_height": null, 863 | "max_width": null, 864 | "min_height": null, 865 | "min_width": null, 866 | "object_fit": null, 867 | "object_position": null, 868 | "order": null, 869 | "overflow": null, 870 | "overflow_x": null, 871 | "overflow_y": null, 872 | "padding": null, 873 | "right": null, 874 | "top": null, 875 | "visibility": null, 876 | "width": null 877 | } 878 | }, 879 | "fd32ad532d594347849bf1aa89fa0b15": { 880 | "model_module": "@jupyter-widgets/base", 881 | "model_module_version": "1.2.0", 882 | "model_name": "LayoutModel", 883 | "state": { 884 | "_model_module": "@jupyter-widgets/base", 885 | "_model_module_version": "1.2.0", 886 | "_model_name": "LayoutModel", 887 | "_view_count": null, 888 | "_view_module": "@jupyter-widgets/base", 889 | "_view_module_version": "1.2.0", 890 | "_view_name": "LayoutView", 891 | "align_content": null, 892 | "align_items": null, 893 | "align_self": null, 894 | "border": null, 895 | "bottom": null, 896 | "display": null, 897 | "flex": null, 898 | "flex_flow": null, 899 | "grid_area": null, 900 | "grid_auto_columns": null, 901 | "grid_auto_flow": null, 902 | "grid_auto_rows": null, 903 | "grid_column": null, 904 | "grid_gap": null, 905 | "grid_row": null, 906 | "grid_template_areas": null, 907 | "grid_template_columns": null, 908 | "grid_template_rows": null, 909 | "height": null, 910 | "justify_content": null, 911 | "justify_items": null, 912 | "left": null, 913 | "margin": null, 914 | "max_height": null, 915 | "max_width": null, 916 | "min_height": null, 917 | "min_width": null, 918 | "object_fit": null, 919 | "object_position": null, 920 | "order": null, 921 | "overflow": null, 922 | "overflow_x": null, 923 | "overflow_y": null, 924 | "padding": null, 925 | "right": null, 926 | "top": null, 927 | "visibility": null, 928 | "width": null 929 | } 930 | } 931 | } 932 | } 933 | }, 934 | "nbformat": 4, 935 | "nbformat_minor": 4 936 | } 937 | -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/Comparision_of_Model_Scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Comparision_of_Model_Scores.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/Experiment Canvas.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Experiment Canvas.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/Llama-3-8B-vs-6B-v0.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Llama-3-8B-vs-6B-v0.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/Training Loss.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Training Loss.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/downcycling.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/downcycling.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/llama-3-6B icon.jpeg: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/llama-3-6B icon.jpeg -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/model_scores.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/model_scores.png -------------------------------------------------------------------------------- /Llama-3/Part 2/assets/model_scores_llama_3_8B.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/model_scores_llama_3_8B.png -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Coding LLMs from scratch 2 | # Coding Llama-2 3 | You will learn how to train and fine-tune Llama 2 model from scratch. 4 | 5 | Throught the series you will learn about transformers architecture, different attention mechanisms (MHA, MQA and GQA), KV cache, RoPE, and Hugginface Trainer in detail. 6 | 7 | By the end, you will have created and trained a LLaMA 2 model with 100M parameters from scratch using PyTorch to do code completion. 8 | 9 | 🎥 **YT Video Playlist:** 10 | - https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr 11 | 12 | 13 | 14 | # Coding Llama-3 15 | 16 | You will learn how to train and fine-tune Llama 3 model from scratch. 17 | 18 | The goal is to code LLaMA 3 from scratch in PyTorch to create models with sizes 3B, 6B, 35B and 45B params. 19 | 20 | 🎥 **YT Video Playlist:** 21 | - https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr 22 | 23 | 📚 **Papers**: 24 | - Sparse Upcycling Training Mixture-of-Experts from Dense Checkpoints 25 | : https://arxiv.org/abs/2212.05055 26 | - Pre-training Small Base LMs with Fewer Tokens: https://arxiv.org/abs/2404.08634 27 | Leave No Context Behind Efficient Infinite Context Transformers with Infini-attention: https://arxiv.org/abs/2404.07143 28 | 29 | 30 | 31 | ## Llama-3-6B-v0.1 32 | Llama-3-6B 33 | 34 | Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) . 35 | The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set: 36 | - Loss: 2.4942 37 | 38 | 39 | ## Model Description 40 | 41 | - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma) 42 | - **Sponsored by:** General 43 | - **Model type:** Llama 44 | - **License:** [Llama-3](https://llama.meta.com/llama3/license) 45 | - **Pretrained from model:** prince-canuma/Llama-3-6B-v0 46 | 47 | ### Model Sources 48 | 49 | - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3 50 | - **Video:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr 51 | 52 | ## Uses 53 | 54 | 55 | You can use this model to create instruct and chat versions for various use cases such as: Coding assistant, RAG, Function Calling and more. 56 | 57 | ### Limitations 58 | 59 | This model inherits some of the base model's limitations and some additional ones from it's creation process, such as: 60 | - Limited scope for coding and math: According to benchmarks, this model needs more pretraining/finetuning on code and math data to excel at reasoning tasks. 61 | - Language Limitations: This model was continually pretrained on english only data. If you are planning to use it for multilingual use cases I recommend fine-tuning or continued pretraining. 62 | 63 | 64 | ## Read more 65 | https://huggingface.co/prince-canuma/Llama-3-6B-v0.1 --------------------------------------------------------------------------------