├── .gitignore
├── Llama-2
├── Part 1
│ └── BabyLLaMA.ipynb
├── Part 2
│ └── BabyLLaMA.ipynb
└── Part 3
│ └── BabyLLaMA.ipynb
├── Llama-3
├── Part 1
│ └── Downcycling.ipynb
└── Part 2
│ ├── Config
│ └── train-llama-3-6B.yml
│ ├── Downcycling_Comparision.ipynb
│ ├── FineWeb10B.ipynb
│ └── assets
│ ├── Comparision_of_Model_Scores.png
│ ├── Experiment Canvas.png
│ ├── Llama-3-8B-vs-6B-v0.png
│ ├── Training Loss.png
│ ├── downcycling.png
│ ├── llama-3-6B icon.jpeg
│ ├── model_scores.png
│ └── model_scores_llama_3_8B.png
└── README.md
/.gitignore:
--------------------------------------------------------------------------------
1 | .DS_Store
2 |
--------------------------------------------------------------------------------
/Llama-2/Part 1/BabyLLaMA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "
\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {
15 | "id": "n4oOcpvm2Bt6"
16 | },
17 | "source": [
18 | " # BabyLLaMA\n",
19 | "\n",
20 | "Coding the LLaMA-2 research paper from scratch to create models with sizes 100M, 250M and 500M params."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {
26 | "id": "NrmuruVN5_2F"
27 | },
28 | "source": [
29 | "## Model Arch\n",
30 | "\n",
31 | "Decoder only: Composed of identical `n_layers`. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple position-wise fully connected FFN. We employ residual connection around each of the sub-layers, followed by layers normalizatin. That is:\n",
32 | "LayerNorm(x + Sublayer(x))\n",
33 | " -- A Vaswani et al., 2017."
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "id": "bZjKLwgz6os2"
41 | },
42 | "outputs": [],
43 | "source": [
44 | "import torch\n",
45 | "from torch import nn\n",
46 | "import torch.nn.functional as F\n",
47 | "from math import sqrt"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {
54 | "id": "xf2Oy7eGLr8f"
55 | },
56 | "outputs": [],
57 | "source": [
58 | "n_layers = 6 # 22 Tiny LLaMA\n",
59 | "n_heads = 6 # 32 Tiny LLaMA\n",
60 | "d_model = 768 # 2048 Tiny LLaMA\n",
61 | "intermediate_dim = d_model * 4"
62 | ]
63 | },
64 | {
65 | "cell_type": "markdown",
66 | "metadata": {
67 | "id": "tmh9Ms0WKW6B"
68 | },
69 | "source": [
70 | "### MHA\n",
71 | "
"
72 | ]
73 | },
74 | {
75 | "cell_type": "code",
76 | "execution_count": null,
77 | "metadata": {
78 | "id": "yR6sUtOaUZ-r"
79 | },
80 | "outputs": [],
81 | "source": [
82 | "# Generate random input data\n",
83 | "sequence_length = 10 # number of tokens\n",
84 | "batch_size = 5\n",
85 | "input_data = torch.rand((batch_size, sequence_length, d_model)) # [bs, sequence_length, d_model]"
86 | ]
87 | },
88 | {
89 | "cell_type": "code",
90 | "execution_count": null,
91 | "metadata": {
92 | "colab": {
93 | "base_uri": "https://localhost:8080/"
94 | },
95 | "id": "Q0P9tmSWVMrz",
96 | "outputId": "debef009-d3d9-41dd-b7bc-dd1a644b15f0"
97 | },
98 | "outputs": [
99 | {
100 | "data": {
101 | "text/plain": [
102 | "torch.Size([5, 10, 768])"
103 | ]
104 | },
105 | "execution_count": 99,
106 | "metadata": {},
107 | "output_type": "execute_result"
108 | }
109 | ],
110 | "source": [
111 | "input_data.shape"
112 | ]
113 | },
114 | {
115 | "cell_type": "markdown",
116 | "metadata": {
117 | "id": "X6fJhFVGZQDG"
118 | },
119 | "source": [
120 | "- MQA\n",
121 | "- GQA"
122 | ]
123 | },
124 | {
125 | "cell_type": "code",
126 | "execution_count": null,
127 | "metadata": {
128 | "id": "L6BA3bdJJTVu"
129 | },
130 | "outputs": [],
131 | "source": [
132 | "class AttentionHead(nn.Module):\n",
133 | " def __init__(self, embed_dim, hidden_dim):\n",
134 | " super(AttentionHead, self).__init__()\n",
135 | " self.q = nn.Linear(embed_dim, hidden_dim)\n",
136 | " self.k = nn.Linear(embed_dim, hidden_dim)\n",
137 | " self.v = nn.Linear(embed_dim, hidden_dim)\n",
138 | "\n",
139 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n",
140 | " dim_k = q.size(-1)\n",
141 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n",
142 | " if mask is not None:\n",
143 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n",
144 | " weights = F.softmax(scores, dim=-1)\n",
145 | " return torch.bmm(weights, v)\n",
146 | "\n",
147 | " def forward(self, hidden_state, mask=None):\n",
148 | " output = self.scaled_dot_product_attention(\n",
149 | " self.q(hidden_state), self.k(hidden_state), self.v(hidden_state), mask=mask\n",
150 | " )\n",
151 | " return output"
152 | ]
153 | },
154 | {
155 | "cell_type": "code",
156 | "execution_count": null,
157 | "metadata": {
158 | "colab": {
159 | "base_uri": "https://localhost:8080/"
160 | },
161 | "id": "yR7l4QQIW_1S",
162 | "outputId": "7ebd7dd2-bff2-4167-d5a5-6a53c85e0d72"
163 | },
164 | "outputs": [
165 | {
166 | "data": {
167 | "text/plain": [
168 | "torch.Size([5, 10, 128])"
169 | ]
170 | },
171 | "execution_count": 8,
172 | "metadata": {},
173 | "output_type": "execute_result"
174 | }
175 | ],
176 | "source": [
177 | "attn = AttentionHead(d_model, d_model//n_heads)\n",
178 | "attn(input_data).shape"
179 | ]
180 | },
181 | {
182 | "cell_type": "code",
183 | "execution_count": null,
184 | "metadata": {
185 | "id": "9D3EVPKaYXRJ"
186 | },
187 | "outputs": [],
188 | "source": [
189 | "class MHA(nn.Module):\n",
190 | " def __init__(self, n_heads, hidden_dim):\n",
191 | " super(MHA, self).__init__()\n",
192 | " embed_dim = hidden_dim\n",
193 | " head_dim = hidden_dim // n_heads\n",
194 | " self.heads = nn.ModuleList(\n",
195 | " [AttentionHead(embed_dim, head_dim) for _ in range(n_heads)]\n",
196 | " )\n",
197 | " self.out_proj = nn.Linear(embed_dim, embed_dim)\n",
198 | "\n",
199 | " def forward(self, hidden_state):\n",
200 | " x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)\n",
201 | " return self.out_proj(x)"
202 | ]
203 | },
204 | {
205 | "cell_type": "code",
206 | "execution_count": null,
207 | "metadata": {
208 | "colab": {
209 | "base_uri": "https://localhost:8080/"
210 | },
211 | "id": "DmeF6UoHe_Vy",
212 | "outputId": "bd59b7a8-4e7f-41f2-82b6-79970a9e36ad"
213 | },
214 | "outputs": [
215 | {
216 | "data": {
217 | "text/plain": [
218 | "torch.Size([5, 10, 768])"
219 | ]
220 | },
221 | "execution_count": 10,
222 | "metadata": {},
223 | "output_type": "execute_result"
224 | }
225 | ],
226 | "source": [
227 | "mha = MHA(n_heads, d_model)\n",
228 | "mha(input_data).shape"
229 | ]
230 | },
231 | {
232 | "cell_type": "code",
233 | "execution_count": null,
234 | "metadata": {
235 | "colab": {
236 | "base_uri": "https://localhost:8080/"
237 | },
238 | "id": "jnqN13SHOMQF",
239 | "outputId": "c43aecaf-fac5-456d-a759-37df602cc6a3"
240 | },
241 | "outputs": [
242 | {
243 | "data": {
244 | "text/plain": [
245 | "torch.Size([5, 10, 768])"
246 | ]
247 | },
248 | "execution_count": 11,
249 | "metadata": {},
250 | "output_type": "execute_result"
251 | }
252 | ],
253 | "source": [
254 | "input_data.shape"
255 | ]
256 | },
257 | {
258 | "cell_type": "code",
259 | "execution_count": null,
260 | "metadata": {
261 | "colab": {
262 | "base_uri": "https://localhost:8080/"
263 | },
264 | "id": "xj0XTxleorjE",
265 | "outputId": "f7e134d7-513b-4e50-f036-df84ef878274"
266 | },
267 | "outputs": [
268 | {
269 | "data": {
270 | "text/plain": [
271 | "MHA(\n",
272 | " (heads): ModuleList(\n",
273 | " (0-5): 6 x AttentionHead(\n",
274 | " (q): Linear(in_features=768, out_features=128, bias=True)\n",
275 | " (k): Linear(in_features=768, out_features=128, bias=True)\n",
276 | " (v): Linear(in_features=768, out_features=128, bias=True)\n",
277 | " )\n",
278 | " )\n",
279 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n",
280 | ")"
281 | ]
282 | },
283 | "execution_count": 12,
284 | "metadata": {},
285 | "output_type": "execute_result"
286 | }
287 | ],
288 | "source": [
289 | "mha"
290 | ]
291 | },
292 | {
293 | "cell_type": "code",
294 | "execution_count": null,
295 | "metadata": {
296 | "id": "rQfC90HwjtgN"
297 | },
298 | "outputs": [],
299 | "source": [
300 | "class LLaMAMLP(nn.Module):\n",
301 | " def __init__(self, hidden_dim, intermediate_dim): # in MLP: intermediate_dim= 4 * hidden_dim\n",
302 | " super(LLaMAMLP, self).__init__()\n",
303 | " self.linear_1 = nn.Linear(hidden_dim, intermediate_dim)\n",
304 | " self.linear_2 = nn.Linear(hidden_dim, intermediate_dim) # Original: intermediate -> hidden.\n",
305 | " self.activation_fn = nn.SiLU()\n",
306 | " self.out_proj = nn.Linear(intermediate_dim, hidden_dim) # Original: dropout\n",
307 | "\n",
308 | "\n",
309 | " def forward(self, hidden_state):\n",
310 | " x_fc_1 = self.linear_1(hidden_state)\n",
311 | " x_fc_2 = self.linear_2(hidden_state)\n",
312 | " x = self.activation_fn(x_fc_1) * x_fc_2\n",
313 | " return self.out_proj(x)"
314 | ]
315 | },
316 | {
317 | "cell_type": "code",
318 | "execution_count": null,
319 | "metadata": {
320 | "colab": {
321 | "base_uri": "https://localhost:8080/"
322 | },
323 | "id": "HifM3AeAq44l",
324 | "outputId": "b7a981f1-a0e7-44a6-bf51-ee44a71d9311"
325 | },
326 | "outputs": [
327 | {
328 | "data": {
329 | "text/plain": [
330 | "torch.Size([5, 10, 768])"
331 | ]
332 | },
333 | "execution_count": 18,
334 | "metadata": {},
335 | "output_type": "execute_result"
336 | }
337 | ],
338 | "source": [
339 | "mlp = LLaMAMLP(d_model, intermediate_dim)\n",
340 | "mlp(input_data).shape"
341 | ]
342 | },
343 | {
344 | "cell_type": "code",
345 | "execution_count": null,
346 | "metadata": {
347 | "id": "DDnXHH6bP4a3"
348 | },
349 | "outputs": [],
350 | "source": [
351 | "class Block(nn.Module):\n",
352 | " def __init__(self, n_heads, hidden_dim, intermediate_dim):\n",
353 | " super(Block, self).__init__()\n",
354 | " self.n_heads = n_heads\n",
355 | " self.hidden_dim = hidden_dim\n",
356 | " self.intermediate_dim = intermediate_dim\n",
357 | " self.mha = MHA(n_heads, hidden_dim=hidden_dim)\n",
358 | " self.layer_norm = nn.LayerNorm(hidden_dim)\n",
359 | " self.mlp = LLaMAMLP(hidden_dim, intermediate_dim)\n",
360 | "\n",
361 | " def forward(self, hidden_state, mask=None):\n",
362 | " x = self.mha(hidden_state)\n",
363 | " x = self.layer_norm(hidden_state) + x\n",
364 | " x_fc = self.mlp(x)\n",
365 | " x += x_fc\n",
366 | " return x\n"
367 | ]
368 | },
369 | {
370 | "cell_type": "code",
371 | "execution_count": null,
372 | "metadata": {
373 | "colab": {
374 | "base_uri": "https://localhost:8080/"
375 | },
376 | "id": "S5UplSTmSDpt",
377 | "outputId": "12f728e4-056d-4bdf-f1eb-f06c5982831c"
378 | },
379 | "outputs": [
380 | {
381 | "data": {
382 | "text/plain": [
383 | "torch.Size([5, 10, 768])"
384 | ]
385 | },
386 | "execution_count": 20,
387 | "metadata": {},
388 | "output_type": "execute_result"
389 | }
390 | ],
391 | "source": [
392 | "block = Block(n_heads, d_model, intermediate_dim)\n",
393 | "block(input_data).shape"
394 | ]
395 | },
396 | {
397 | "cell_type": "code",
398 | "execution_count": null,
399 | "metadata": {
400 | "id": "GcngjpHCSUnh"
401 | },
402 | "outputs": [],
403 | "source": [
404 | "class babyLLaMA(nn.Module):\n",
405 | " def __init__(self, max_seq_len, vocab_size, n_layers, n_heads, hidden_dim, intermediate_dim):\n",
406 | " super(babyLLaMA, self).__init__()\n",
407 | " self.emb = nn.Embedding(vocab_size, hidden_dim)\n",
408 | " self.pos = nn.Embedding(max_seq_len, hidden_dim)\n",
409 | " self.blocks = nn.ModuleList(\n",
410 | " [Block(n_heads, hidden_dim, intermediate_dim) for _ in range(n_layers)]\n",
411 | " )\n",
412 | " self.out_proj = nn.Linear(hidden_dim, vocab_size)\n",
413 | "\n",
414 | " def forward(self, hidden_state):\n",
415 | " emb = self.emb(hidden_state)\n",
416 | " seq_len = hidden_state.size(1)\n",
417 | " positions = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)\n",
418 | " pos = self.pos(positions)\n",
419 | " x = emb + pos\n",
420 | " for b in self.blocks:\n",
421 | " x = b(x)\n",
422 | "\n",
423 | " x = self.out_proj(x)\n",
424 | " return F.softmax(x, dim=-1)\n"
425 | ]
426 | },
427 | {
428 | "cell_type": "code",
429 | "execution_count": null,
430 | "metadata": {
431 | "id": "HZqvBWtmVxUk"
432 | },
433 | "outputs": [],
434 | "source": [
435 | "llm = babyLLaMA(d_model, 32000, 22, n_heads, d_model, intermediate_dim)\n",
436 | "input_ids = torch.randint(1, 32000, (batch_size, sequence_length))"
437 | ]
438 | },
439 | {
440 | "cell_type": "code",
441 | "execution_count": null,
442 | "metadata": {
443 | "colab": {
444 | "base_uri": "https://localhost:8080/"
445 | },
446 | "id": "Lk9xlnKeYNuY",
447 | "outputId": "6b45abed-323a-4ea7-f105-0a9503ba1b85"
448 | },
449 | "outputs": [
450 | {
451 | "data": {
452 | "text/plain": [
453 | "torch.Size([5, 10, 32000])"
454 | ]
455 | },
456 | "execution_count": 43,
457 | "metadata": {},
458 | "output_type": "execute_result"
459 | }
460 | ],
461 | "source": [
462 | "llm(input_ids).shape"
463 | ]
464 | },
465 | {
466 | "cell_type": "code",
467 | "execution_count": null,
468 | "metadata": {
469 | "colab": {
470 | "base_uri": "https://localhost:8080/"
471 | },
472 | "id": "geZxTDtrYica",
473 | "outputId": "23823184-49d3-4140-8bc9-af08580d9f7d"
474 | },
475 | "outputs": [
476 | {
477 | "data": {
478 | "text/plain": [
479 | "babyLLaMA(\n",
480 | " (emb): Embedding(32000, 768)\n",
481 | " (pos): Embedding(768, 768)\n",
482 | " (blocks): ModuleList(\n",
483 | " (0-5): 6 x Block(\n",
484 | " (mha): MHA(\n",
485 | " (heads): ModuleList(\n",
486 | " (0-5): 6 x AttentionHead(\n",
487 | " (q): Linear(in_features=768, out_features=128, bias=True)\n",
488 | " (k): Linear(in_features=768, out_features=128, bias=True)\n",
489 | " (v): Linear(in_features=768, out_features=128, bias=True)\n",
490 | " )\n",
491 | " )\n",
492 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n",
493 | " )\n",
494 | " (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
495 | " (mlp): LLaMAMLP(\n",
496 | " (linear_1): Linear(in_features=768, out_features=3072, bias=True)\n",
497 | " (linear_2): Linear(in_features=768, out_features=3072, bias=True)\n",
498 | " (activation_fn): SiLU()\n",
499 | " (out_proj): Linear(in_features=3072, out_features=768, bias=True)\n",
500 | " )\n",
501 | " )\n",
502 | " )\n",
503 | " (out_proj): Linear(in_features=768, out_features=32000, bias=True)\n",
504 | ")"
505 | ]
506 | },
507 | "execution_count": 38,
508 | "metadata": {},
509 | "output_type": "execute_result"
510 | }
511 | ],
512 | "source": [
513 | "llm"
514 | ]
515 | },
516 | {
517 | "cell_type": "code",
518 | "execution_count": null,
519 | "metadata": {
520 | "id": "3jNqAo_vYlHT"
521 | },
522 | "outputs": [],
523 | "source": [
524 | "def count_parameters(model):\n",
525 | " return sum(p.numel() for p in model.parameters() if p.requires_grad)"
526 | ]
527 | },
528 | {
529 | "cell_type": "code",
530 | "execution_count": null,
531 | "metadata": {
532 | "colab": {
533 | "base_uri": "https://localhost:8080/"
534 | },
535 | "id": "FUnuoPINY9VS",
536 | "outputId": "cd62a0e3-8f17-43cf-9f60-cb8122eb1913"
537 | },
538 | "outputs": [
539 | {
540 | "data": {
541 | "text/plain": [
542 | "257645312"
543 | ]
544 | },
545 | "execution_count": 45,
546 | "metadata": {},
547 | "output_type": "execute_result"
548 | }
549 | ],
550 | "source": [
551 | "count_parameters(llm)"
552 | ]
553 | },
554 | {
555 | "cell_type": "code",
556 | "execution_count": null,
557 | "metadata": {
558 | "id": "MJSfGx2I2Io6"
559 | },
560 | "outputs": [],
561 | "source": []
562 | }
563 | ],
564 | "metadata": {
565 | "colab": {
566 | "provenance": []
567 | },
568 | "kernelspec": {
569 | "display_name": "Python 3",
570 | "name": "python3"
571 | },
572 | "language_info": {
573 | "name": "python"
574 | }
575 | },
576 | "nbformat": 4,
577 | "nbformat_minor": 0
578 | }
579 |
--------------------------------------------------------------------------------
/Llama-2/Part 2/BabyLLaMA.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "markdown",
5 | "metadata": {},
6 | "source": [
7 | "\n",
8 | "
\n",
9 | ""
10 | ]
11 | },
12 | {
13 | "cell_type": "markdown",
14 | "metadata": {
15 | "id": "n4oOcpvm2Bt6"
16 | },
17 | "source": [
18 | " # BabyLLaMA\n",
19 | "\n",
20 | "Coding the LLaMA-2 research paper from scratch to create models with sizes 100M, 250M and 500M params."
21 | ]
22 | },
23 | {
24 | "cell_type": "markdown",
25 | "metadata": {
26 | "id": "NrmuruVN5_2F"
27 | },
28 | "source": [
29 | "## Model Arch\n",
30 | "\n",
31 | "Decoder only: Composed of identical `n_layers`. Each layer has two sub-layers. The first is a multi-head self-attention mechanism, and the second is a simple position-wise fully connected FFN. We employ residual connection around each of the sub-layers, followed by layers normalizatin. That is:\n",
32 | "LayerNorm(x + Sublayer(x))\n",
33 | " -- A Vaswani et al., 2017."
34 | ]
35 | },
36 | {
37 | "cell_type": "code",
38 | "execution_count": null,
39 | "metadata": {
40 | "id": "3jNqAo_vYlHT"
41 | },
42 | "outputs": [],
43 | "source": [
44 | "def count_parameters(model):\n",
45 | " return sum(p.numel() for p in model.parameters() if p.requires_grad)"
46 | ]
47 | },
48 | {
49 | "cell_type": "code",
50 | "execution_count": null,
51 | "metadata": {
52 | "id": "bZjKLwgz6os2"
53 | },
54 | "outputs": [],
55 | "source": [
56 | "import torch\n",
57 | "from torch import nn\n",
58 | "import torch.nn.functional as F\n",
59 | "from math import sqrt"
60 | ]
61 | },
62 | {
63 | "cell_type": "code",
64 | "execution_count": null,
65 | "metadata": {
66 | "id": "xf2Oy7eGLr8f"
67 | },
68 | "outputs": [],
69 | "source": [
70 | "n_layers = 6 # 22 Tiny LLaMA\n",
71 | "n_heads = 6 # 32 Tiny LLaMA\n",
72 | "d_model = 768 # 2048 Tiny LLaMA\n",
73 | "intermediate_dim = d_model * 4"
74 | ]
75 | },
76 | {
77 | "cell_type": "markdown",
78 | "metadata": {
79 | "id": "tmh9Ms0WKW6B"
80 | },
81 | "source": [
82 | "### MHA\n",
83 | "
"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {
90 | "id": "yR6sUtOaUZ-r"
91 | },
92 | "outputs": [],
93 | "source": [
94 | "# Generate random input data\n",
95 | "sequence_length = 10 # number of tokens\n",
96 | "batch_size = 5\n",
97 | "input_data = torch.rand((batch_size, sequence_length, d_model)) # [bs, sequence_length, d_model]"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "colab": {
105 | "base_uri": "https://localhost:8080/"
106 | },
107 | "id": "Q0P9tmSWVMrz",
108 | "outputId": "debef009-d3d9-41dd-b7bc-dd1a644b15f0"
109 | },
110 | "outputs": [
111 | {
112 | "data": {
113 | "text/plain": [
114 | "torch.Size([5, 10, 768])"
115 | ]
116 | },
117 | "execution_count": 99,
118 | "metadata": {},
119 | "output_type": "execute_result"
120 | }
121 | ],
122 | "source": [
123 | "input_data.shape"
124 | ]
125 | },
126 | {
127 | "cell_type": "markdown",
128 | "metadata": {
129 | "id": "X6fJhFVGZQDG"
130 | },
131 | "source": [
132 | "- MQA\n",
133 | "- GQA"
134 | ]
135 | },
136 | {
137 | "cell_type": "code",
138 | "execution_count": null,
139 | "metadata": {
140 | "id": "L6BA3bdJJTVu"
141 | },
142 | "outputs": [],
143 | "source": [
144 | "class AttentionHead(nn.Module):\n",
145 | " def __init__(self, embed_dim, hidden_dim):\n",
146 | " super(AttentionHead, self).__init__()\n",
147 | " self.q = nn.Linear(embed_dim, hidden_dim)\n",
148 | " self.k = nn.Linear(embed_dim, hidden_dim)\n",
149 | " self.v = nn.Linear(embed_dim, hidden_dim)\n",
150 | "\n",
151 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n",
152 | " dim_k = q.size(-1)\n",
153 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n",
154 | " if mask is not None:\n",
155 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n",
156 | " weights = F.softmax(scores, dim=-1)\n",
157 | " return torch.bmm(weights, v)\n",
158 | "\n",
159 | " def forward(self, hidden_state, mask = None):\n",
160 | " output = self.scaled_dot_product_attention(\n",
161 | " self.q(hidden_state), self.k(hidden_state), self.v(hidden_state), mask=mask\n",
162 | " )\n",
163 | " return output"
164 | ]
165 | },
166 | {
167 | "cell_type": "code",
168 | "execution_count": null,
169 | "metadata": {
170 | "colab": {
171 | "base_uri": "https://localhost:8080/"
172 | },
173 | "id": "yR7l4QQIW_1S",
174 | "outputId": "7ebd7dd2-bff2-4167-d5a5-6a53c85e0d72"
175 | },
176 | "outputs": [
177 | {
178 | "data": {
179 | "text/plain": [
180 | "torch.Size([5, 10, 128])"
181 | ]
182 | },
183 | "execution_count": 8,
184 | "metadata": {},
185 | "output_type": "execute_result"
186 | }
187 | ],
188 | "source": [
189 | "attn = AttentionHead(d_model, d_model//n_heads)\n",
190 | "attn(input_data).shape"
191 | ]
192 | },
193 | {
194 | "cell_type": "code",
195 | "execution_count": null,
196 | "metadata": {
197 | "id": "xdkLltN5kMlb"
198 | },
199 | "outputs": [],
200 | "source": []
201 | },
202 | {
203 | "cell_type": "markdown",
204 | "metadata": {
205 | "id": "xzQ-tT98j4v4"
206 | },
207 | "source": [
208 | "## MHA vs GQA vs MQA\n",
209 | ""
210 | ]
211 | },
212 | {
213 | "cell_type": "code",
214 | "execution_count": null,
215 | "metadata": {
216 | "id": "9D3EVPKaYXRJ"
217 | },
218 | "outputs": [],
219 | "source": [
220 | "class MultiHeadAttention(nn.Module):\n",
221 | " def __init__(self, n_heads, hidden_dim):\n",
222 | " super(MultiHeadAttention, self).__init__()\n",
223 | " embed_dim = hidden_dim\n",
224 | " head_dim = hidden_dim // n_heads\n",
225 | " self.heads = nn.ModuleList(\n",
226 | " [AttentionHead(embed_dim, head_dim) for _ in range(n_heads)]\n",
227 | " )\n",
228 | " self.out_proj = nn.Linear(embed_dim, embed_dim)\n",
229 | "\n",
230 | " def forward(self, hidden_state):\n",
231 | " x = torch.cat([h(hidden_state) for h in self.heads], dim=-1)\n",
232 | " return self.out_proj(x)\n",
233 | "\n",
234 | "\n",
235 | "class MultiQueryAttention(nn.Module):\n",
236 | " def __init__(self, n_q_heads, hidden_dim):\n",
237 | " super(MultiQueryAttention, self).__init__()\n",
238 | " head_dim = hidden_dim // n_heads\n",
239 | " self.queries = nn.ModuleList(\n",
240 | " [nn.Linear(hidden_dim, head_dim) for _ in range(n_q_heads)]\n",
241 | " )\n",
242 | " self.key = nn.Linear(hidden_dim, head_dim)\n",
243 | " self.value = nn.Linear(hidden_dim, head_dim)\n",
244 | " self.out_proj = nn.Linear(n_q_heads * head_dim, hidden_dim)\n",
245 | "\n",
246 | " def scaled_dot_product_attention(self, q, k, v, mask = None):\n",
247 | " dim_k = q.size(-1)\n",
248 | " scores = torch.bmm(q, k.transpose(1, 2)) / sqrt(dim_k) # k.T = [bs, seq_len, embed_dim] -> [bs, embed_dim, seq_len]\n",
249 | " if mask is not None:\n",
250 | " scores = torch.masked_fill(scores, mask == 0, -torch.inf)\n",
251 | " weights = F.softmax(scores, dim=-1)\n",
252 | " return torch.bmm(weights, v)\n",
253 | "\n",
254 | " def forward(self, hidden_state):\n",
255 | " k = self.key(hidden_state)\n",
256 | " v = self.value(hidden_state)\n",
257 | " x = torch.cat([\n",
258 | " self.scaled_dot_product_attention(query(hidden_state), k, v)\n",
259 | " for query in self.queries\n",
260 | " ], dim=-1)\n",
261 | " return self.out_proj(x)\n",
262 | "\n",
263 | "\n",
264 | "class GroupedQueryAttention(nn.Module):\n",
265 | " def __init__(self, n_q_heads_per_group, n_k_v_heads, hidden_dim):\n",
266 | " super(GroupedQueryAttention, self).__init__()\n",
267 | " self.n_k_v_heads = n_k_v_heads\n",
268 | " self.n_q_heads_per_group = n_q_heads_per_group\n",
269 | " self.hidden_dim = hidden_dim\n",
270 | " self.grouped = nn.ModuleList([\n",
271 | " MultiQueryAttention(\n",
272 | " n_q_heads=n_q_heads_per_group, hidden_dim=hidden_dim\n",
273 | " )\n",
274 | " for _ in range(n_k_v_heads)\n",
275 | " ])\n",
276 | " self.proj = nn.Linear(in_features=hidden_dim * n_k_v_heads,\n",
277 | " out_features=hidden_dim, bias=False)\n",
278 | "\n",
279 | " def forward(self, hidden_state, mask=None):\n",
280 | " Z_s = torch.cat([head(hidden_state) for head in self.grouped], dim=-1)\n",
281 | " Z = self.proj(Z_s)\n",
282 | " return Z"
283 | ]
284 | },
285 | {
286 | "cell_type": "code",
287 | "execution_count": null,
288 | "metadata": {
289 | "colab": {
290 | "base_uri": "https://localhost:8080/"
291 | },
292 | "id": "DmeF6UoHe_Vy",
293 | "outputId": "059a3f41-23f0-4cc0-93e2-069b634091be"
294 | },
295 | "outputs": [
296 | {
297 | "data": {
298 | "text/plain": [
299 | "torch.Size([5, 10, 768])"
300 | ]
301 | },
302 | "execution_count": 23,
303 | "metadata": {},
304 | "output_type": "execute_result"
305 | }
306 | ],
307 | "source": [
308 | "mha = MultiHeadAttention(n_heads, d_model)\n",
309 | "mha(input_data).shape"
310 | ]
311 | },
312 | {
313 | "cell_type": "code",
314 | "execution_count": null,
315 | "metadata": {
316 | "colab": {
317 | "base_uri": "https://localhost:8080/"
318 | },
319 | "id": "kHzHnc6QuzDx",
320 | "outputId": "3777829b-8b5a-49b4-e56f-6c6c40ace32a"
321 | },
322 | "outputs": [
323 | {
324 | "data": {
325 | "text/plain": [
326 | "2362368"
327 | ]
328 | },
329 | "execution_count": 86,
330 | "metadata": {},
331 | "output_type": "execute_result"
332 | }
333 | ],
334 | "source": [
335 | "count_parameters(mha)"
336 | ]
337 | },
338 | {
339 | "cell_type": "code",
340 | "execution_count": null,
341 | "metadata": {
342 | "colab": {
343 | "base_uri": "https://localhost:8080/"
344 | },
345 | "id": "UxWS1kkIkI4C",
346 | "outputId": "64757630-b346-4975-8cd6-0a1b9e346a00"
347 | },
348 | "outputs": [
349 | {
350 | "data": {
351 | "text/plain": [
352 | "torch.Size([5, 10, 768])"
353 | ]
354 | },
355 | "execution_count": 85,
356 | "metadata": {},
357 | "output_type": "execute_result"
358 | }
359 | ],
360 | "source": [
361 | "mqa = MultiQueryAttention(n_heads, d_model)\n",
362 | "mqa(input_data).shape"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {
369 | "colab": {
370 | "base_uri": "https://localhost:8080/"
371 | },
372 | "id": "jnqN13SHOMQF",
373 | "outputId": "86d62f49-c22a-4d75-edb5-abd4ff8839c5"
374 | },
375 | "outputs": [
376 | {
377 | "data": {
378 | "text/plain": [
379 | "1378048"
380 | ]
381 | },
382 | "execution_count": 84,
383 | "metadata": {},
384 | "output_type": "execute_result"
385 | }
386 | ],
387 | "source": [
388 | "count_parameters(mqa)"
389 | ]
390 | },
391 | {
392 | "cell_type": "code",
393 | "execution_count": null,
394 | "metadata": {
395 | "colab": {
396 | "base_uri": "https://localhost:8080/"
397 | },
398 | "id": "awEPrfZumBhd",
399 | "outputId": "90d9cba3-fe1c-46c1-b28e-756c7292f1b2"
400 | },
401 | "outputs": [
402 | {
403 | "data": {
404 | "text/plain": [
405 | "torch.Size([5, 10, 768])"
406 | ]
407 | },
408 | "execution_count": 103,
409 | "metadata": {},
410 | "output_type": "execute_result"
411 | }
412 | ],
413 | "source": [
414 | "num_q_heads = n_heads\n",
415 | "n_k_v_heads=2\n",
416 | "n_q_heads_per_group = num_q_heads // n_k_v_heads\n",
417 | "\n",
418 | "gqa = GroupedQueryAttention(n_q_heads_per_group=n_q_heads_per_group, n_k_v_heads=n_k_v_heads, hidden_dim=d_model)\n",
419 | "gqa(input_data).shape"
420 | ]
421 | },
422 | {
423 | "cell_type": "code",
424 | "execution_count": null,
425 | "metadata": {
426 | "colab": {
427 | "base_uri": "https://localhost:8080/"
428 | },
429 | "id": "7YVIJen6urJW",
430 | "outputId": "5ea7e561-8deb-4757-823c-00785dcfd9ba"
431 | },
432 | "outputs": [
433 | {
434 | "data": {
435 | "text/plain": [
436 | "2755328"
437 | ]
438 | },
439 | "execution_count": 104,
440 | "metadata": {},
441 | "output_type": "execute_result"
442 | }
443 | ],
444 | "source": [
445 | "count_parameters(gqa)"
446 | ]
447 | },
448 | {
449 | "cell_type": "code",
450 | "execution_count": null,
451 | "metadata": {
452 | "colab": {
453 | "base_uri": "https://localhost:8080/"
454 | },
455 | "id": "xj0XTxleorjE",
456 | "outputId": "f7e134d7-513b-4e50-f036-df84ef878274"
457 | },
458 | "outputs": [
459 | {
460 | "data": {
461 | "text/plain": [
462 | "MHA(\n",
463 | " (heads): ModuleList(\n",
464 | " (0-5): 6 x AttentionHead(\n",
465 | " (q): Linear(in_features=768, out_features=128, bias=True)\n",
466 | " (k): Linear(in_features=768, out_features=128, bias=True)\n",
467 | " (v): Linear(in_features=768, out_features=128, bias=True)\n",
468 | " )\n",
469 | " )\n",
470 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n",
471 | ")"
472 | ]
473 | },
474 | "execution_count": 12,
475 | "metadata": {},
476 | "output_type": "execute_result"
477 | }
478 | ],
479 | "source": [
480 | "mha"
481 | ]
482 | },
483 | {
484 | "cell_type": "code",
485 | "execution_count": null,
486 | "metadata": {
487 | "id": "rQfC90HwjtgN"
488 | },
489 | "outputs": [],
490 | "source": [
491 | "class LLaMAMLP(nn.Module):\n",
492 | " def __init__(self, hidden_dim, intermediate_dim): # in MLP: intermediate_dim= 4 * hidden_dim\n",
493 | " super(LLaMAMLP, self).__init__()\n",
494 | " self.linear_1 = nn.Linear(hidden_dim, intermediate_dim)\n",
495 | " self.linear_2 = nn.Linear(hidden_dim, intermediate_dim) # Original: intermediate -> hidden.\n",
496 | " self.activation_fn = nn.SiLU()\n",
497 | " self.out_proj = nn.Linear(intermediate_dim, hidden_dim) # Original: dropout\n",
498 | "\n",
499 | "\n",
500 | " def forward(self, hidden_state):\n",
501 | " x_fc_1 = self.linear_1(hidden_state)\n",
502 | " x_fc_2 = self.linear_2(hidden_state)\n",
503 | " x = self.activation_fn(x_fc_1) * x_fc_2\n",
504 | " return self.out_proj(x)"
505 | ]
506 | },
507 | {
508 | "cell_type": "code",
509 | "execution_count": null,
510 | "metadata": {
511 | "colab": {
512 | "base_uri": "https://localhost:8080/"
513 | },
514 | "id": "HifM3AeAq44l",
515 | "outputId": "f562f104-fe04-4045-d96f-bab97c8fb4f9"
516 | },
517 | "outputs": [
518 | {
519 | "data": {
520 | "text/plain": [
521 | "torch.Size([5, 10, 768])"
522 | ]
523 | },
524 | "execution_count": 51,
525 | "metadata": {},
526 | "output_type": "execute_result"
527 | }
528 | ],
529 | "source": [
530 | "mlp = LLaMAMLP(d_model, intermediate_dim)\n",
531 | "mlp(input_data).shape"
532 | ]
533 | },
534 | {
535 | "cell_type": "code",
536 | "execution_count": null,
537 | "metadata": {
538 | "id": "DDnXHH6bP4a3"
539 | },
540 | "outputs": [],
541 | "source": [
542 | "class Block(nn.Module):\n",
543 | " def __init__(self, n_heads, n_k_v_heads, hidden_dim, intermediate_dim):\n",
544 | " super(Block, self).__init__()\n",
545 | " self.n_heads = n_heads\n",
546 | " self.hidden_dim = hidden_dim\n",
547 | " self.intermediate_dim = intermediate_dim\n",
548 | "\n",
549 | " # Self-Attention (MHA, MQA & GQA)\n",
550 | " if n_heads == n_k_v_heads:\n",
551 | " # MHA selected\n",
552 | " self.attn = MultiHeadAttention(n_heads, hidden_dim=hidden_dim)\n",
553 | " elif n_k_v_heads == 1 :\n",
554 | " # MQA selected\n",
555 | " self.attn = MultiQueryAttention(n_heads, hidden_dim=hidden_dim)\n",
556 | " elif n_heads // n_k_v_heads > 1:\n",
557 | " # GQA selected\n",
558 | " self.attn = GroupedQueryAttention(n_heads // n_k_v_heads, n_k_v_heads, hidden_dim=hidden_dim)\n",
559 | " else:\n",
560 | " # MHA selected\n",
561 | " self.attn = MultiHeadAttention(n_heads, hidden_dim=hidden_dim)\n",
562 | "\n",
563 | " self.layer_norm = nn.LayerNorm(hidden_dim)\n",
564 | " self.mlp = LLaMAMLP(hidden_dim, intermediate_dim)\n",
565 | "\n",
566 | " def forward(self, hidden_state, mask=None):\n",
567 | " x = self.attn(hidden_state)\n",
568 | " x = self.layer_norm(hidden_state) + x\n",
569 | " x_fc = self.mlp(x)\n",
570 | " x += x_fc\n",
571 | " return x\n"
572 | ]
573 | },
574 | {
575 | "cell_type": "code",
576 | "execution_count": null,
577 | "metadata": {
578 | "colab": {
579 | "base_uri": "https://localhost:8080/"
580 | },
581 | "id": "S5UplSTmSDpt",
582 | "outputId": "cb272aad-0958-4267-9266-f0a5ee4eb557"
583 | },
584 | "outputs": [
585 | {
586 | "name": "stdout",
587 | "output_type": "stream",
588 | "text": [
589 | "GQA selected\n"
590 | ]
591 | },
592 | {
593 | "data": {
594 | "text/plain": [
595 | "torch.Size([5, 10, 768])"
596 | ]
597 | },
598 | "execution_count": 112,
599 | "metadata": {},
600 | "output_type": "execute_result"
601 | }
602 | ],
603 | "source": [
604 | "block = Block(n_heads, n_k_v_heads, d_model, intermediate_dim)\n",
605 | "block(input_data).shape"
606 | ]
607 | },
608 | {
609 | "cell_type": "code",
610 | "execution_count": null,
611 | "metadata": {
612 | "id": "GcngjpHCSUnh"
613 | },
614 | "outputs": [],
615 | "source": [
616 | "class babyLLaMA(nn.Module):\n",
617 | " def __init__(self, max_seq_len, vocab_size, n_layers, n_heads, n_k_v_heads, hidden_dim, intermediate_dim):\n",
618 | " super(babyLLaMA, self).__init__()\n",
619 | " self.emb = nn.Embedding(vocab_size, hidden_dim)\n",
620 | " self.pos = nn.Embedding(max_seq_len, hidden_dim)\n",
621 | " self.blocks = nn.ModuleList(\n",
622 | " [Block(n_heads, n_k_v_heads, hidden_dim, intermediate_dim) for _ in range(n_layers)]\n",
623 | " )\n",
624 | " self.out_proj = nn.Linear(hidden_dim, vocab_size)\n",
625 | "\n",
626 | " def forward(self, hidden_state):\n",
627 | " emb = self.emb(hidden_state)\n",
628 | " seq_len = hidden_state.size(1)\n",
629 | " positions = torch.arange(seq_len, dtype=torch.long).unsqueeze(0)\n",
630 | " pos = self.pos(positions)\n",
631 | " x = emb + pos\n",
632 | "\n",
633 | " for b in self.blocks:\n",
634 | " x = b(x)\n",
635 | "\n",
636 | " x = self.out_proj(x)\n",
637 | " return F.softmax(x, dim=-1)\n"
638 | ]
639 | },
640 | {
641 | "cell_type": "code",
642 | "execution_count": null,
643 | "metadata": {
644 | "id": "HZqvBWtmVxUk"
645 | },
646 | "outputs": [],
647 | "source": [
648 | "llm = babyLLaMA(d_model, 32000, 22, n_heads, n_k_v_heads, d_model, intermediate_dim)\n",
649 | "input_ids = torch.randint(1, 32000, (batch_size, sequence_length))"
650 | ]
651 | },
652 | {
653 | "cell_type": "code",
654 | "execution_count": null,
655 | "metadata": {
656 | "colab": {
657 | "base_uri": "https://localhost:8080/"
658 | },
659 | "id": "Lk9xlnKeYNuY",
660 | "outputId": "c380de0c-0737-4537-a586-22a2d757a1d0"
661 | },
662 | "outputs": [
663 | {
664 | "data": {
665 | "text/plain": [
666 | "torch.Size([5, 10, 32000])"
667 | ]
668 | },
669 | "execution_count": 69,
670 | "metadata": {},
671 | "output_type": "execute_result"
672 | }
673 | ],
674 | "source": [
675 | "llm(input_ids).shape"
676 | ]
677 | },
678 | {
679 | "cell_type": "code",
680 | "execution_count": null,
681 | "metadata": {
682 | "colab": {
683 | "base_uri": "https://localhost:8080/"
684 | },
685 | "id": "geZxTDtrYica",
686 | "outputId": "13f3cc25-5ed2-453c-a787-c6e5e8a3d3f8"
687 | },
688 | "outputs": [
689 | {
690 | "data": {
691 | "text/plain": [
692 | "babyLLaMA(\n",
693 | " (emb): Embedding(32000, 768)\n",
694 | " (pos): Embedding(768, 768)\n",
695 | " (blocks): ModuleList(\n",
696 | " (0-21): 22 x Block(\n",
697 | " (mqa): MultiQueryAttention(\n",
698 | " (queries): ModuleList(\n",
699 | " (0-5): 6 x Linear(in_features=768, out_features=128, bias=True)\n",
700 | " )\n",
701 | " (key): Linear(in_features=768, out_features=128, bias=True)\n",
702 | " (value): Linear(in_features=768, out_features=128, bias=True)\n",
703 | " (out_proj): Linear(in_features=768, out_features=768, bias=True)\n",
704 | " )\n",
705 | " (layer_norm): LayerNorm((768,), eps=1e-05, elementwise_affine=True)\n",
706 | " (mlp): LLaMAMLP(\n",
707 | " (linear_1): Linear(in_features=768, out_features=3072, bias=True)\n",
708 | " (linear_2): Linear(in_features=768, out_features=3072, bias=True)\n",
709 | " (activation_fn): SiLU()\n",
710 | " (out_proj): Linear(in_features=3072, out_features=768, bias=True)\n",
711 | " )\n",
712 | " )\n",
713 | " )\n",
714 | " (out_proj): Linear(in_features=768, out_features=32000, bias=True)\n",
715 | ")"
716 | ]
717 | },
718 | "execution_count": 75,
719 | "metadata": {},
720 | "output_type": "execute_result"
721 | }
722 | ],
723 | "source": [
724 | "llm"
725 | ]
726 | },
727 | {
728 | "cell_type": "code",
729 | "execution_count": null,
730 | "metadata": {
731 | "colab": {
732 | "base_uri": "https://localhost:8080/"
733 | },
734 | "id": "FUnuoPINY9VS",
735 | "outputId": "09a464ce-0c95-4bdf-a3e3-561e51fa27a6"
736 | },
737 | "outputs": [
738 | {
739 | "data": {
740 | "text/plain": [
741 | "266290432"
742 | ]
743 | },
744 | "execution_count": 127,
745 | "metadata": {},
746 | "output_type": "execute_result"
747 | }
748 | ],
749 | "source": [
750 | "count_parameters(llm)"
751 | ]
752 | }
753 | ],
754 | "metadata": {
755 | "colab": {
756 | "provenance": []
757 | },
758 | "kernelspec": {
759 | "display_name": "Python 3",
760 | "name": "python3"
761 | },
762 | "language_info": {
763 | "name": "python"
764 | }
765 | },
766 | "nbformat": 4,
767 | "nbformat_minor": 0
768 | }
769 |
--------------------------------------------------------------------------------
/Llama-3/Part 1/Downcycling.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "! pip install transformers torch accelerate huggingface-hub huggingface-cli hf-transfer"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "def count_parameters(model):\n",
19 | " # Calculate the number of parameters in billions\n",
20 | " num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 10**9\n",
21 | " print(f\"Model size: {num_params:.3f}B parameters\")\n",
22 | " return int(num_params)\n"
23 | ]
24 | },
25 | {
26 | "cell_type": "markdown",
27 | "metadata": {},
28 | "source": [
29 | "## Load Reference Model"
30 | ]
31 | },
32 | {
33 | "cell_type": "code",
34 | "execution_count": null,
35 | "metadata": {},
36 | "outputs": [],
37 | "source": [
38 | "from transformers import AutoModelForCausalLM, AutoConfig, AutoTokenizer\n",
39 | "import os\n",
40 | "\n",
41 | "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n",
42 | "\n",
43 | "# Load meta-llama/Meta-Llama-3-8B model, config and tokenizer\n",
44 | "model_name = \"meta-llama/Meta-Llama-3-8B\"\n",
45 | "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
46 | "config = AutoConfig.from_pretrained(model_name)\n",
47 | "tokenizer = AutoTokenizer.from_pretrained(model_name)"
48 | ]
49 | },
50 | {
51 | "cell_type": "code",
52 | "execution_count": null,
53 | "metadata": {},
54 | "outputs": [],
55 | "source": [
56 | "count_parameters(model)"
57 | ]
58 | },
59 | {
60 | "cell_type": "code",
61 | "execution_count": null,
62 | "metadata": {},
63 | "outputs": [],
64 | "source": [
65 | "model"
66 | ]
67 | },
68 | {
69 | "cell_type": "code",
70 | "execution_count": null,
71 | "metadata": {},
72 | "outputs": [],
73 | "source": [
74 | "def extract_model_weights(reference_model, n_layers):\n",
75 | " params = {}\n",
76 | " current_layer = 0 # To keep track of the main layer count\n",
77 | "\n",
78 | " # Iterate over all named modules\n",
79 | " for name, module in reference_model.named_modules():\n",
80 | "\n",
81 | " # Check and store parameters\n",
82 | " if hasattr(module, 'weight') and module.weight is not None:\n",
83 | " params[name + '.weight'] = module.weight.data.clone()\n",
84 | " if hasattr(module, 'bias') and module.bias is not None:\n",
85 | " params[name + '.bias'] = module.bias.data.clone()\n",
86 | "\n",
87 | " if 'model.layers.' in name:\n",
88 | " # Check the layer index\n",
89 | " layer_index = int(name.split('.')[2]) # This splits the name and gets the third element\n",
90 | " if layer_index > current_layer:\n",
91 | " current_layer = layer_index\n",
92 | " if current_layer > n_layers-1:\n",
93 | " break # Stop after reaching the specified main layer\n",
94 | "\n",
95 | " norm_layer = model.model.norm # Adjust this path based on your model's architecture\n",
96 | " if hasattr(norm_layer, 'weight') and norm_layer.weight is not None:\n",
97 | " params['model.norm.weight'] = norm_layer.weight.data.clone()\n",
98 | " if hasattr(norm_layer, 'bias') and norm_layer.bias is not None:\n",
99 | " params['model.norm.bias'] = norm_layer.bias.data.clone()\n",
100 | "\n",
101 | " lm_head = reference_model.lm_head\n",
102 | " if hasattr(lm_head, 'weight') and lm_head.weight is not None:\n",
103 | " params[\"lm_head.weight\"] = lm_head.weight.data\n",
104 | " if hasattr(lm_head, 'bias') and lm_head.bias is not None:\n",
105 | " params[\"lm_head.bias\"] = lm_head.bias.data\n",
106 | "\n",
107 | " return params\n"
108 | ]
109 | },
110 | {
111 | "cell_type": "code",
112 | "execution_count": null,
113 | "metadata": {},
114 | "outputs": [],
115 | "source": [
116 | "target_model_n_layers = 24\n",
117 | "pretrained_weights = extract_model_weights(model, target_model_n_layers)"
118 | ]
119 | },
120 | {
121 | "cell_type": "code",
122 | "execution_count": null,
123 | "metadata": {},
124 | "outputs": [],
125 | "source": [
126 | "from transformers import AutoModelForCausalLM, AutoConfig\n",
127 | "config = AutoConfig.from_pretrained(model_name)\n",
128 | "config.num_hidden_layers = target_model_n_layers\n",
129 | "target_model = AutoModelForCausalLM.from_config(config)\n"
130 | ]
131 | },
132 | {
133 | "cell_type": "code",
134 | "execution_count": null,
135 | "metadata": {},
136 | "outputs": [],
137 | "source": [
138 | "target_model_size = count_parameters(target_model)"
139 | ]
140 | },
141 | {
142 | "cell_type": "code",
143 | "execution_count": null,
144 | "metadata": {},
145 | "outputs": [],
146 | "source": [
147 | "target_model.load_state_dict(pretrained_weights)\n"
148 | ]
149 | },
150 | {
151 | "cell_type": "code",
152 | "execution_count": null,
153 | "metadata": {},
154 | "outputs": [],
155 | "source": [
156 | "inputs = tokenizer(\n",
157 | "[\n",
158 | " \"Who created Python?\"\n",
159 | "], return_tensors = \"pt\")\n",
160 | "\n",
161 | "# inputs = tokenizer.apply_chat_template(\n",
162 | "# [\n",
163 | "# # {\"content\":\"\",\"role\":\"system\"},\n",
164 | "# {\"content\":\"\"\"Given the question: Read the article and select the best\n",
165 | "# answer. Article: Can you swim? Do you like swimming? Well, how can you\n",
166 | "# learn to swim? I think the best way is to go into the water and learn.\n",
167 | "# I'm afraid you'll never learn to swim just by reading books about\n",
168 | "# Swimming or looking at others swimming. It's the same with the English\n",
169 | "# study. We must practice, practice and practice. Listening and speaking\n",
170 | "# are very important for beginners. We can listen to English programs on radio.\n",
171 | "# You may just understand a few words. It doesn't matter. Just be relaxed,\n",
172 | "# try to catch every word. Somebody may be a good listener, but he is afraid\n",
173 | "# to speak because he's afraid of making mistakes. You know we sometimes\n",
174 | "# make mistakes when we speak Chinese. Don't be afraid. We must be brave.\n",
175 | "# If you really want to learn English well, you must try to speak with\n",
176 | "# everyone as long as he knows English. When there's nobody to talk with,\n",
177 | "# you can talk to yourself in English. It's interesting and also a good\n",
178 | "# way to practice your spoken English. Remember, the more you speak, the\n",
179 | "# fewer mistakes you'll make. Reading and writing are more important for\n",
180 | "# senior school students. First we must choose the books we're interested\n",
181 | "# in. A lot of reading will improve your language sense.\n",
182 | "# This is very important. It's easier said than done. Well, let's do\n",
183 | "# more practice from now on. I'm sure you'll learn English well in this\n",
184 | "# way. ,A, B, C, D,. (10)\n",
185 | "# Question: Which is the best title for the passage?\n",
186 | "# Options:\n",
187 | "# A: How to Learn English.\n",
188 | "# B: Easier Said Than Done.\n",
189 | "# C: Listen First, Speak Second.\n",
190 | "# D: How to learn to Swim.\\n\n",
191 | "# The answer is:\"\"\",\"role\":\"user\"}\n",
192 | "# ], add_generation_prompt=True, return_tensors='pt',\n",
193 | "# )"
194 | ]
195 | },
196 | {
197 | "cell_type": "code",
198 | "execution_count": null,
199 | "metadata": {},
200 | "outputs": [],
201 | "source": [
202 | "from transformers import TextStreamer\n",
203 | "text_streamer = TextStreamer(tokenizer)\n",
204 | "_ = target_model.generate(**inputs, streamer = text_streamer, max_new_tokens = 200)"
205 | ]
206 | },
207 | {
208 | "cell_type": "code",
209 | "execution_count": null,
210 | "metadata": {},
211 | "outputs": [],
212 | "source": [
213 | "target_model.push_to_hub(\"Llama-3-6B-Instruct-v0.1\")\n",
214 | "tokenizer.push_to_hub(\"Llama-3-6B-Instruct-v0.1\")"
215 | ]
216 | },
217 | {
218 | "cell_type": "markdown",
219 | "metadata": {},
220 | "source": [
221 | "# Downcycling by getting the first X layers and last X layers \n",
222 | "\n",
223 | "Where X is a N/2.\n",
224 | "\n",
225 | "For instance, if our target number layers is 24 then X will be 24/2 = 12."
226 | ]
227 | },
228 | {
229 | "cell_type": "code",
230 | "execution_count": null,
231 | "metadata": {},
232 | "outputs": [],
233 | "source": [
234 | "target_model_n_layers = 24\n",
235 | "weights_1 = model.model.layers[:target_model_n_layers//2]\n",
236 | "weights_2 = model.model.layers[-target_model_n_layers//2:]\n",
237 | "\n",
238 | "# Assuming 'model' is your pre-existing large model\n",
239 | "# This part is conceptual, assuming the model is split into exactly 24 layers evenly.\n",
240 | "\n",
241 | "# Extract weights for the first 12 layers\n",
242 | "weights_1 = {f'model.layers.{k}': v.clone() for k, v in weights_1.state_dict().items() }\n",
243 | "\n",
244 | "# Extract weights for the last 12 layers\n",
245 | "weights_2 = {f'model.layers.{k}': v.clone() for k, v in weights_2.state_dict().items()}\n"
246 | ]
247 | },
248 | {
249 | "cell_type": "code",
250 | "execution_count": null,
251 | "metadata": {},
252 | "outputs": [],
253 | "source": [
254 | "# Get remainder modules weights\n",
255 | "weights_1[\"model.embed_tokens.weight\"] = model.model.state_dict()['embed_tokens.weight']\n",
256 | "weights_2[\"model.norm.weight\"] = model.model.state_dict()['norm.weight']\n",
257 | "weights_2[\"lm_head.weight\"] = model.state_dict()['lm_head.weight']"
258 | ]
259 | },
260 | {
261 | "cell_type": "code",
262 | "execution_count": null,
263 | "metadata": {},
264 | "outputs": [],
265 | "source": [
266 | "import re\n",
267 | "def update_layer_numbers(state_dict, x_size):\n",
268 | " new_state_dict = {}\n",
269 | " # Regular expression to find and manipulate the layer numbers\n",
270 | " pattern = re.compile(r'model.layers.(\\d+)')\n",
271 | "\n",
272 | " for key, value in state_dict.items():\n",
273 | " # Search for the pattern and update\n",
274 | " new_key = pattern.sub(lambda x: f\"model.layers.{int(x.group(1)) + x_size}\", key)\n",
275 | " new_state_dict[new_key] = value\n",
276 | "\n",
277 | " return new_state_dict\n",
278 | "\n",
279 | "\n",
280 | "weights_2 = update_layer_numbers(weights_2, target_model_n_layers//2)"
281 | ]
282 | },
283 | {
284 | "cell_type": "code",
285 | "execution_count": null,
286 | "metadata": {},
287 | "outputs": [],
288 | "source": [
289 | "from transformers import AutoModelForCausalLM, AutoConfig\n",
290 | "config = AutoConfig.from_pretrained(model_name)\n",
291 | "config.num_hidden_layers = target_model_n_layers\n",
292 | "target_model = AutoModelForCausalLM.from_config(config)"
293 | ]
294 | },
295 | {
296 | "cell_type": "code",
297 | "execution_count": null,
298 | "metadata": {},
299 | "outputs": [],
300 | "source": [
301 | "target_model.load_state_dict({**weights_1, **weights_2})"
302 | ]
303 | },
304 | {
305 | "cell_type": "code",
306 | "execution_count": null,
307 | "metadata": {},
308 | "outputs": [],
309 | "source": [
310 | "count_parameters(target_model)"
311 | ]
312 | },
313 | {
314 | "cell_type": "code",
315 | "execution_count": null,
316 | "metadata": {},
317 | "outputs": [],
318 | "source": [
319 | "inputs = tokenizer.apply_chat_template(\n",
320 | " [\n",
321 | " # {\"content\":\"\",\"role\":\"system\"},\n",
322 | " {\"content\":\"\"\"Given the question: Read the article and select the best\n",
323 | " answer. Article: Can you swim? Do you like swimming? Well, how can you\n",
324 | " learn to swim? I think the best way is to go into the water and learn.\n",
325 | " I'm afraid you'll never learn to swim just by reading books about\n",
326 | " Swimming or looking at others swimming. It's the same with the English\n",
327 | " study. We must practice, practice and practice. Listening and speaking\n",
328 | " are very important for beginners. We can listen to English programs on radio.\n",
329 | " You may just understand a few words. It doesn't matter. Just be relaxed,\n",
330 | " try to catch every word. Somebody may be a good listener, but he is afraid\n",
331 | " to speak because he's afraid of making mistakes. You know we sometimes\n",
332 | " make mistakes when we speak Chinese. Don't be afraid. We must be brave.\n",
333 | " If you really want to learn English well, you must try to speak with\n",
334 | " everyone as long as he knows English. When there's nobody to talk with,\n",
335 | " you can talk to yourself in English. It's interesting and also a good\n",
336 | " way to practice your spoken English. Remember, the more you speak, the\n",
337 | " fewer mistakes you'll make. Reading and writing are more important for\n",
338 | " senior school students. First we must choose the books we're interested\n",
339 | " in. A lot of reading will improve your language sense.\n",
340 | " This is very important. It's easier said than done. Well, let's do\n",
341 | " more practice from now on. I'm sure you'll learn English well in this\n",
342 | " way. ,A, B, C, D,. (10)\n",
343 | " Question: Which is the best title for the passage?\n",
344 | " Options:\n",
345 | " A: How to Learn English.\n",
346 | " B: Easier Said Than Done.\n",
347 | " C: Listen First, Speak Second.\n",
348 | " D: How to learn to Swim.\\n\n",
349 | " The answer is:\"\"\",\"role\":\"user\"}\n",
350 | " ], add_generation_prompt=True, return_tensors='pt',\n",
351 | ")"
352 | ]
353 | },
354 | {
355 | "cell_type": "code",
356 | "execution_count": null,
357 | "metadata": {},
358 | "outputs": [],
359 | "source": [
360 | "from transformers import TextStreamer\n",
361 | "text_streamer = TextStreamer(tokenizer)\n",
362 | "_ = target_model.generate(inputs, streamer = text_streamer, max_new_tokens = 128)"
363 | ]
364 | },
365 | {
366 | "cell_type": "code",
367 | "execution_count": null,
368 | "metadata": {},
369 | "outputs": [],
370 | "source": [
371 | "target_model.push_to_hub(\"Llama-3-6B-Instruct-Granite-v0.1\")\n",
372 | "tokenizer.push_to_hub(\"Llama-3-6B-Instruct-Granite-v0.1\")"
373 | ]
374 | }
375 | ],
376 | "metadata": {
377 | "kernelspec": {
378 | "display_name": "mlx_code",
379 | "language": "python",
380 | "name": "python3"
381 | },
382 | "language_info": {
383 | "codemirror_mode": {
384 | "name": "ipython",
385 | "version": 3
386 | },
387 | "file_extension": ".py",
388 | "mimetype": "text/x-python",
389 | "name": "python",
390 | "nbconvert_exporter": "python",
391 | "pygments_lexer": "ipython3",
392 | "version": "3.10.14"
393 | }
394 | },
395 | "nbformat": 4,
396 | "nbformat_minor": 2
397 | }
398 |
--------------------------------------------------------------------------------
/Llama-3/Part 2/Config/train-llama-3-6B.yml:
--------------------------------------------------------------------------------
1 | base_model: prince-canuma/Llama-3-6B-v0
2 | model_type: AutoModelForCausalLM
3 | tokenizer_type: AutoTokenizer
4 |
5 | load_in_8bit: false
6 | load_in_4bit: true
7 | strict: false
8 |
9 | datasets:
10 | - path: prince-canuma/fineweb-CC-MAIN-2024-10-1B-en
11 | type: completion
12 | split: train
13 | dataset_prepared_path: last_run_prepared
14 | val_set_size: 0.001
15 | output_dir: ./llama-3-6b
16 | save_safetensors: true
17 | adapter: qlora
18 | lora_model_dir:
19 |
20 | sequence_len: 8192
21 | sample_packing: false
22 | pad_to_sequence_len: false
23 |
24 | lora_r: 128
25 | lora_alpha: 128
26 | lora_dropout: 0.05
27 | lora_target_modules:
28 | lora_target_linear: true
29 | lora_fan_in_fan_out:
30 |
31 |
32 | wandb_project: llama-3-6b
33 | wandb_entity:
34 | wandb_watch:
35 | wandb_name:
36 | wandb_log_model:
37 |
38 | gradient_accumulation_steps: 8
39 | micro_batch_size: 2
40 | num_epochs: 2
41 | optimizer: paged_adamw_32bit
42 | lr_scheduler: cosine
43 | learning_rate: 2e-4
44 |
45 | train_on_inputs: false
46 | group_by_length: false
47 | bf16: auto
48 | fp16:
49 | tf32: false
50 |
51 | gradient_checkpointing: true
52 | early_stopping_patience:
53 | resume_from_checkpoint:
54 | local_rank:
55 | logging_steps: 1
56 | xformers_attention:
57 | flash_attention: true
58 |
59 | warmup_steps: 100
60 | evals_per_epoch: 4
61 | eval_table_size:
62 | save_steps: 4000
63 | debug:
64 | deepspeed:
65 | weight_decay: 0.0
66 | fsdp:
67 | fsdp_config:
68 | special_tokens:
69 | pad_token: "<|reserved_special_token_0|>"
70 |
71 |
--------------------------------------------------------------------------------
/Llama-3/Part 2/Downcycling_Comparision.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {},
7 | "outputs": [],
8 | "source": [
9 | "! pip install transformers torch accelerate huggingface-hub huggingface-cli hf-transfer"
10 | ]
11 | },
12 | {
13 | "cell_type": "code",
14 | "execution_count": null,
15 | "metadata": {},
16 | "outputs": [],
17 | "source": [
18 | "from transformers import TextStreamer\n",
19 | "\n",
20 | "def count_parameters(model):\n",
21 | " # Calculate the number of parameters in billions\n",
22 | " num_params = sum(p.numel() for p in model.parameters() if p.requires_grad) / 10**9\n",
23 | " print(f\"Model size: {num_params:.3f}B parameters\")\n",
24 | " return int(num_params)\n",
25 | "\n",
26 | "def generate(model, tokenizer, inputs, max_new_tokens=50):\n",
27 | " text_streamer = TextStreamer(tokenizer)\n",
28 | " _ = model.generate(**inputs, streamer = text_streamer, max_new_tokens = max_new_tokens)"
29 | ]
30 | },
31 | {
32 | "cell_type": "markdown",
33 | "metadata": {},
34 | "source": [
35 | "
"
36 | ]
37 | },
38 | {
39 | "cell_type": "markdown",
40 | "metadata": {},
41 | "source": [
42 | "## Load Untrained Downcycled Model"
43 | ]
44 | },
45 | {
46 | "cell_type": "code",
47 | "execution_count": null,
48 | "metadata": {},
49 | "outputs": [],
50 | "source": [
51 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
52 | "import os\n",
53 | "\n",
54 | "os.environ[\"HF_HUB_ENABLE_HF_TRANSFER\"] = \"1\"\n",
55 | "\n",
56 | "# Load model, config and tokenizer\n",
57 | "model_name = \"prince-canuma/Llama-3-6B-v0\"\n",
58 | "untrained_model = AutoModelForCausalLM.from_pretrained(model_name)\n",
59 | "tokenizer = AutoTokenizer.from_pretrained(model_name)"
60 | ]
61 | },
62 | {
63 | "cell_type": "markdown",
64 | "metadata": {},
65 | "source": [
66 | "## Load Pretrained Downcycled Model"
67 | ]
68 | },
69 | {
70 | "cell_type": "code",
71 | "execution_count": null,
72 | "metadata": {},
73 | "outputs": [],
74 | "source": [
75 | "from transformers import AutoModelForCausalLM, AutoTokenizer\n",
76 | "\n",
77 | "\n",
78 | "# Load model, config and tokenizer\n",
79 | "model_name = \"prince-canuma/Llama-3-6B-v0.1\"\n",
80 | "model = AutoModelForCausalLM.from_pretrained(model_name)\n",
81 | "tokenizer = AutoTokenizer.from_pretrained(model_name)"
82 | ]
83 | },
84 | {
85 | "cell_type": "code",
86 | "execution_count": null,
87 | "metadata": {},
88 | "outputs": [],
89 | "source": [
90 | "count_parameters(model)"
91 | ]
92 | },
93 | {
94 | "cell_type": "code",
95 | "execution_count": null,
96 | "metadata": {},
97 | "outputs": [],
98 | "source": [
99 | "inputs = tokenizer(\n",
100 | "[\n",
101 | " \"The Eifel tower is located in\"\n",
102 | "], return_tensors = \"pt\")\n"
103 | ]
104 | },
105 | {
106 | "cell_type": "code",
107 | "execution_count": null,
108 | "metadata": {},
109 | "outputs": [],
110 | "source": [
111 | "generate(untrained_model, tokenizer, inputs)"
112 | ]
113 | },
114 | {
115 | "cell_type": "code",
116 | "execution_count": null,
117 | "metadata": {},
118 | "outputs": [],
119 | "source": [
120 | "generate(model, tokenizer, inputs)"
121 | ]
122 | }
123 | ],
124 | "metadata": {
125 | "kernelspec": {
126 | "display_name": "mlx_code",
127 | "language": "python",
128 | "name": "python3"
129 | },
130 | "language_info": {
131 | "codemirror_mode": {
132 | "name": "ipython",
133 | "version": 3
134 | },
135 | "file_extension": ".py",
136 | "mimetype": "text/x-python",
137 | "name": "python",
138 | "nbconvert_exporter": "python",
139 | "pygments_lexer": "ipython3",
140 | "version": "3.10.14"
141 | }
142 | },
143 | "nbformat": 4,
144 | "nbformat_minor": 2
145 | }
146 |
--------------------------------------------------------------------------------
/Llama-3/Part 2/FineWeb10B.ipynb:
--------------------------------------------------------------------------------
1 | {
2 | "cells": [
3 | {
4 | "cell_type": "code",
5 | "execution_count": null,
6 | "metadata": {
7 | "colab": {
8 | "base_uri": "https://localhost:8080/"
9 | },
10 | "id": "KDeqo4iUPPK_",
11 | "outputId": "0726a097-ea22-481d-c5ce-7f330576053e"
12 | },
13 | "outputs": [],
14 | "source": [
15 | "!pip install datasets huggingface-hub[cli] hf-transfer pyarrow"
16 | ]
17 | },
18 | {
19 | "cell_type": "code",
20 | "execution_count": null,
21 | "metadata": {
22 | "colab": {
23 | "base_uri": "https://localhost:8080/",
24 | "height": 205,
25 | "referenced_widgets": [
26 | "3b6f554968d54e70a8e9ece4763fa612",
27 | "22b34f9b327746dd93ebe6538482ffdc",
28 | "0b69c40ac70542d7809ae6240c6ea8d3",
29 | "2921d2c9d1a7426bab54397a71772757",
30 | "a1e41ea2a9dd4138bcba2643d472ec78",
31 | "31ae5b1ddf8c405d9b97c901043365d7",
32 | "2e15dd36207146789d737ad4077d368b",
33 | "e4457ef8ca424b74a546e2fa85cdf69f",
34 | "53c8d989d57443ce8cd5861721bbdb41",
35 | "0eb083ee46f743d4b429bf6612da55a6",
36 | "048a01b6f7034148b5fa53d0e33088b0",
37 | "b9da0257d25e47d087dda0644be79d0f",
38 | "952a134784564a40a26f60fdc24a4b29",
39 | "c86d3e5c71804299bd3626eda5ad6bec",
40 | "3d78d66f951c48969c2036764cd415d0",
41 | "ed5c1bfee6084b40953064d2cf66ebdd",
42 | "fd32ad532d594347849bf1aa89fa0b15",
43 | "7c5d9781693245f8bf888e8f909f097f",
44 | "4289ac665bd3494b8091af46aa03d27b",
45 | "05eafd28789747c5b8541e81d392aef4",
46 | "bc96b42534554357872d4497616054c4",
47 | "6a59d212d5184dc39ce77c279ebca348"
48 | ]
49 | },
50 | "id": "jh97g8-gPe3H",
51 | "outputId": "e0c87cbd-939e-4faf-ab9d-87993f1a80ae"
52 | },
53 | "outputs": [],
54 | "source": [
55 | "from datasets import load_dataset\n",
56 | "# use name=\"sample-10BT\" to use the 10BT sample\n",
57 | "fw = load_dataset(\"HuggingFaceFW/fineweb\", name=\"CC-MAIN-2024-10\", split=\"train\", streaming=True)\n"
58 | ]
59 | },
60 | {
61 | "cell_type": "code",
62 | "execution_count": null,
63 | "metadata": {
64 | "colab": {
65 | "base_uri": "https://localhost:8080/"
66 | },
67 | "id": "yPWI91xuQGaV",
68 | "outputId": "00c4901f-dd90-411f-da71-8a831ca4241d"
69 | },
70 | "outputs": [],
71 | "source": [
72 | "fw"
73 | ]
74 | },
75 | {
76 | "cell_type": "code",
77 | "execution_count": null,
78 | "metadata": {
79 | "id": "oDnktugVQHru"
80 | },
81 | "outputs": [],
82 | "source": [
83 | "filtered_dataset = fw.filter(lambda example: example['language'] == 'en')"
84 | ]
85 | },
86 | {
87 | "cell_type": "code",
88 | "execution_count": null,
89 | "metadata": {
90 | "id": "Y9StbBlrQhgf"
91 | },
92 | "outputs": [],
93 | "source": [
94 | "from tqdm import tqdm\n",
95 | "\n",
96 | "# Wrapping the 'take' method call with tqdm to display a progress bar\n",
97 | "dataset_10B = [item for item in tqdm(filtered_dataset.take(15000000), total=15000000)]\n"
98 | ]
99 | },
100 | {
101 | "cell_type": "code",
102 | "execution_count": null,
103 | "metadata": {
104 | "id": "sSSY3K9kSwW5"
105 | },
106 | "outputs": [],
107 | "source": [
108 | "import pandas as pd\n",
109 | "dataset = pd.DataFrame(dataset_10B, index=None)"
110 | ]
111 | },
112 | {
113 | "cell_type": "code",
114 | "execution_count": null,
115 | "metadata": {
116 | "colab": {
117 | "base_uri": "https://localhost:8080/",
118 | "height": 1000
119 | },
120 | "id": "EXuxcMFBUYjQ",
121 | "outputId": "0881e898-89c2-4ebc-a3e4-7d479db33a77"
122 | },
123 | "outputs": [],
124 | "source": [
125 | "dataset"
126 | ]
127 | },
128 | {
129 | "cell_type": "code",
130 | "execution_count": null,
131 | "metadata": {
132 | "colab": {
133 | "base_uri": "https://localhost:8080/"
134 | },
135 | "id": "Ex1lKDNIVlTj",
136 | "outputId": "9789ff15-9a27-4bd8-b307-34039e3a2983"
137 | },
138 | "outputs": [],
139 | "source": [
140 | "dataset[\"token_count\"].sum()"
141 | ]
142 | },
143 | {
144 | "cell_type": "code",
145 | "execution_count": null,
146 | "metadata": {},
147 | "outputs": [],
148 | "source": [
149 | "dataset.to_csv('fineweb-CC-MAIN-2024-10-1B-en.csv', index=False)"
150 | ]
151 | },
152 | {
153 | "cell_type": "code",
154 | "execution_count": null,
155 | "metadata": {},
156 | "outputs": [],
157 | "source": [
158 | "from datasets import load_dataset\n",
159 | "\n",
160 | "\n",
161 | "dataset = load_dataset(\"csv\", data_files = {\"train\":\"fineweb-CC-MAIN-2024-10-8B-en.csv\"})"
162 | ]
163 | },
164 | {
165 | "cell_type": "code",
166 | "execution_count": null,
167 | "metadata": {},
168 | "outputs": [],
169 | "source": [
170 | "dataset.push_to_hub(\"fineweb-CC-MAIN-2024-10-8B-en\")"
171 | ]
172 | },
173 | {
174 | "cell_type": "code",
175 | "execution_count": null,
176 | "metadata": {},
177 | "outputs": [],
178 | "source": [
179 | "N = 8300000 # Specify the number of rows you want\n",
180 | "dataset_6b = dataset[\"train\"].select(range(N))"
181 | ]
182 | },
183 | {
184 | "cell_type": "code",
185 | "execution_count": null,
186 | "metadata": {},
187 | "outputs": [],
188 | "source": [
189 | "sum(dataset_6b['token_count'])"
190 | ]
191 | },
192 | {
193 | "cell_type": "code",
194 | "execution_count": null,
195 | "metadata": {},
196 | "outputs": [],
197 | "source": [
198 | "dataset_6b.push_to_hub(\"fineweb-CC-MAIN-2024-10-6B-en\")"
199 | ]
200 | },
201 | {
202 | "cell_type": "code",
203 | "execution_count": null,
204 | "metadata": {},
205 | "outputs": [],
206 | "source": [
207 | "from datasets import Dataset\n",
208 | "dataset_1b = Dataset.from_pandas(dataset, preserve_index=False)\n",
209 | "\n",
210 | "N = 1500000 # Specify the number of rows you want\n",
211 | "dataset_1b = dataset[\"train\"].select(range(N))"
212 | ]
213 | },
214 | {
215 | "cell_type": "code",
216 | "execution_count": null,
217 | "metadata": {},
218 | "outputs": [],
219 | "source": [
220 | "dataset_1b.push_to_hub(\"fineweb-CC-MAIN-2024-10-1B-en\")"
221 | ]
222 | }
223 | ],
224 | "metadata": {
225 | "colab": {
226 | "provenance": []
227 | },
228 | "kernelspec": {
229 | "display_name": "Python 3 (ipykernel)",
230 | "language": "python",
231 | "name": "python3"
232 | },
233 | "language_info": {
234 | "codemirror_mode": {
235 | "name": "ipython",
236 | "version": 3
237 | },
238 | "file_extension": ".py",
239 | "mimetype": "text/x-python",
240 | "name": "python",
241 | "nbconvert_exporter": "python",
242 | "pygments_lexer": "ipython3",
243 | "version": "3.10.14"
244 | },
245 | "widgets": {
246 | "application/vnd.jupyter.widget-state+json": {
247 | "048a01b6f7034148b5fa53d0e33088b0": {
248 | "model_module": "@jupyter-widgets/controls",
249 | "model_module_version": "1.5.0",
250 | "model_name": "DescriptionStyleModel",
251 | "state": {
252 | "_model_module": "@jupyter-widgets/controls",
253 | "_model_module_version": "1.5.0",
254 | "_model_name": "DescriptionStyleModel",
255 | "_view_count": null,
256 | "_view_module": "@jupyter-widgets/base",
257 | "_view_module_version": "1.2.0",
258 | "_view_name": "StyleView",
259 | "description_width": ""
260 | }
261 | },
262 | "05eafd28789747c5b8541e81d392aef4": {
263 | "model_module": "@jupyter-widgets/controls",
264 | "model_module_version": "1.5.0",
265 | "model_name": "ProgressStyleModel",
266 | "state": {
267 | "_model_module": "@jupyter-widgets/controls",
268 | "_model_module_version": "1.5.0",
269 | "_model_name": "ProgressStyleModel",
270 | "_view_count": null,
271 | "_view_module": "@jupyter-widgets/base",
272 | "_view_module_version": "1.2.0",
273 | "_view_name": "StyleView",
274 | "bar_color": null,
275 | "description_width": ""
276 | }
277 | },
278 | "0b69c40ac70542d7809ae6240c6ea8d3": {
279 | "model_module": "@jupyter-widgets/controls",
280 | "model_module_version": "1.5.0",
281 | "model_name": "FloatProgressModel",
282 | "state": {
283 | "_dom_classes": [],
284 | "_model_module": "@jupyter-widgets/controls",
285 | "_model_module_version": "1.5.0",
286 | "_model_name": "FloatProgressModel",
287 | "_view_count": null,
288 | "_view_module": "@jupyter-widgets/controls",
289 | "_view_module_version": "1.5.0",
290 | "_view_name": "ProgressView",
291 | "bar_style": "success",
292 | "description": "",
293 | "description_tooltip": null,
294 | "layout": "IPY_MODEL_e4457ef8ca424b74a546e2fa85cdf69f",
295 | "max": 23032,
296 | "min": 0,
297 | "orientation": "horizontal",
298 | "style": "IPY_MODEL_53c8d989d57443ce8cd5861721bbdb41",
299 | "value": 23032
300 | }
301 | },
302 | "0eb083ee46f743d4b429bf6612da55a6": {
303 | "model_module": "@jupyter-widgets/base",
304 | "model_module_version": "1.2.0",
305 | "model_name": "LayoutModel",
306 | "state": {
307 | "_model_module": "@jupyter-widgets/base",
308 | "_model_module_version": "1.2.0",
309 | "_model_name": "LayoutModel",
310 | "_view_count": null,
311 | "_view_module": "@jupyter-widgets/base",
312 | "_view_module_version": "1.2.0",
313 | "_view_name": "LayoutView",
314 | "align_content": null,
315 | "align_items": null,
316 | "align_self": null,
317 | "border": null,
318 | "bottom": null,
319 | "display": null,
320 | "flex": null,
321 | "flex_flow": null,
322 | "grid_area": null,
323 | "grid_auto_columns": null,
324 | "grid_auto_flow": null,
325 | "grid_auto_rows": null,
326 | "grid_column": null,
327 | "grid_gap": null,
328 | "grid_row": null,
329 | "grid_template_areas": null,
330 | "grid_template_columns": null,
331 | "grid_template_rows": null,
332 | "height": null,
333 | "justify_content": null,
334 | "justify_items": null,
335 | "left": null,
336 | "margin": null,
337 | "max_height": null,
338 | "max_width": null,
339 | "min_height": null,
340 | "min_width": null,
341 | "object_fit": null,
342 | "object_position": null,
343 | "order": null,
344 | "overflow": null,
345 | "overflow_x": null,
346 | "overflow_y": null,
347 | "padding": null,
348 | "right": null,
349 | "top": null,
350 | "visibility": null,
351 | "width": null
352 | }
353 | },
354 | "22b34f9b327746dd93ebe6538482ffdc": {
355 | "model_module": "@jupyter-widgets/controls",
356 | "model_module_version": "1.5.0",
357 | "model_name": "HTMLModel",
358 | "state": {
359 | "_dom_classes": [],
360 | "_model_module": "@jupyter-widgets/controls",
361 | "_model_module_version": "1.5.0",
362 | "_model_name": "HTMLModel",
363 | "_view_count": null,
364 | "_view_module": "@jupyter-widgets/controls",
365 | "_view_module_version": "1.5.0",
366 | "_view_name": "HTMLView",
367 | "description": "",
368 | "description_tooltip": null,
369 | "layout": "IPY_MODEL_31ae5b1ddf8c405d9b97c901043365d7",
370 | "placeholder": "",
371 | "style": "IPY_MODEL_2e15dd36207146789d737ad4077d368b",
372 | "value": "Resolving data files: 100%"
373 | }
374 | },
375 | "2921d2c9d1a7426bab54397a71772757": {
376 | "model_module": "@jupyter-widgets/controls",
377 | "model_module_version": "1.5.0",
378 | "model_name": "HTMLModel",
379 | "state": {
380 | "_dom_classes": [],
381 | "_model_module": "@jupyter-widgets/controls",
382 | "_model_module_version": "1.5.0",
383 | "_model_name": "HTMLModel",
384 | "_view_count": null,
385 | "_view_module": "@jupyter-widgets/controls",
386 | "_view_module_version": "1.5.0",
387 | "_view_name": "HTMLView",
388 | "description": "",
389 | "description_tooltip": null,
390 | "layout": "IPY_MODEL_0eb083ee46f743d4b429bf6612da55a6",
391 | "placeholder": "",
392 | "style": "IPY_MODEL_048a01b6f7034148b5fa53d0e33088b0",
393 | "value": " 23032/23032 [00:00<00:00, 7.06it/s]"
394 | }
395 | },
396 | "2e15dd36207146789d737ad4077d368b": {
397 | "model_module": "@jupyter-widgets/controls",
398 | "model_module_version": "1.5.0",
399 | "model_name": "DescriptionStyleModel",
400 | "state": {
401 | "_model_module": "@jupyter-widgets/controls",
402 | "_model_module_version": "1.5.0",
403 | "_model_name": "DescriptionStyleModel",
404 | "_view_count": null,
405 | "_view_module": "@jupyter-widgets/base",
406 | "_view_module_version": "1.2.0",
407 | "_view_name": "StyleView",
408 | "description_width": ""
409 | }
410 | },
411 | "31ae5b1ddf8c405d9b97c901043365d7": {
412 | "model_module": "@jupyter-widgets/base",
413 | "model_module_version": "1.2.0",
414 | "model_name": "LayoutModel",
415 | "state": {
416 | "_model_module": "@jupyter-widgets/base",
417 | "_model_module_version": "1.2.0",
418 | "_model_name": "LayoutModel",
419 | "_view_count": null,
420 | "_view_module": "@jupyter-widgets/base",
421 | "_view_module_version": "1.2.0",
422 | "_view_name": "LayoutView",
423 | "align_content": null,
424 | "align_items": null,
425 | "align_self": null,
426 | "border": null,
427 | "bottom": null,
428 | "display": null,
429 | "flex": null,
430 | "flex_flow": null,
431 | "grid_area": null,
432 | "grid_auto_columns": null,
433 | "grid_auto_flow": null,
434 | "grid_auto_rows": null,
435 | "grid_column": null,
436 | "grid_gap": null,
437 | "grid_row": null,
438 | "grid_template_areas": null,
439 | "grid_template_columns": null,
440 | "grid_template_rows": null,
441 | "height": null,
442 | "justify_content": null,
443 | "justify_items": null,
444 | "left": null,
445 | "margin": null,
446 | "max_height": null,
447 | "max_width": null,
448 | "min_height": null,
449 | "min_width": null,
450 | "object_fit": null,
451 | "object_position": null,
452 | "order": null,
453 | "overflow": null,
454 | "overflow_x": null,
455 | "overflow_y": null,
456 | "padding": null,
457 | "right": null,
458 | "top": null,
459 | "visibility": null,
460 | "width": null
461 | }
462 | },
463 | "3b6f554968d54e70a8e9ece4763fa612": {
464 | "model_module": "@jupyter-widgets/controls",
465 | "model_module_version": "1.5.0",
466 | "model_name": "HBoxModel",
467 | "state": {
468 | "_dom_classes": [],
469 | "_model_module": "@jupyter-widgets/controls",
470 | "_model_module_version": "1.5.0",
471 | "_model_name": "HBoxModel",
472 | "_view_count": null,
473 | "_view_module": "@jupyter-widgets/controls",
474 | "_view_module_version": "1.5.0",
475 | "_view_name": "HBoxView",
476 | "box_style": "",
477 | "children": [
478 | "IPY_MODEL_22b34f9b327746dd93ebe6538482ffdc",
479 | "IPY_MODEL_0b69c40ac70542d7809ae6240c6ea8d3",
480 | "IPY_MODEL_2921d2c9d1a7426bab54397a71772757"
481 | ],
482 | "layout": "IPY_MODEL_a1e41ea2a9dd4138bcba2643d472ec78"
483 | }
484 | },
485 | "3d78d66f951c48969c2036764cd415d0": {
486 | "model_module": "@jupyter-widgets/controls",
487 | "model_module_version": "1.5.0",
488 | "model_name": "HTMLModel",
489 | "state": {
490 | "_dom_classes": [],
491 | "_model_module": "@jupyter-widgets/controls",
492 | "_model_module_version": "1.5.0",
493 | "_model_name": "HTMLModel",
494 | "_view_count": null,
495 | "_view_module": "@jupyter-widgets/controls",
496 | "_view_module_version": "1.5.0",
497 | "_view_name": "HTMLView",
498 | "description": "",
499 | "description_tooltip": null,
500 | "layout": "IPY_MODEL_bc96b42534554357872d4497616054c4",
501 | "placeholder": "",
502 | "style": "IPY_MODEL_6a59d212d5184dc39ce77c279ebca348",
503 | "value": " 250/250 [00:00<00:00, 6483.86it/s]"
504 | }
505 | },
506 | "4289ac665bd3494b8091af46aa03d27b": {
507 | "model_module": "@jupyter-widgets/base",
508 | "model_module_version": "1.2.0",
509 | "model_name": "LayoutModel",
510 | "state": {
511 | "_model_module": "@jupyter-widgets/base",
512 | "_model_module_version": "1.2.0",
513 | "_model_name": "LayoutModel",
514 | "_view_count": null,
515 | "_view_module": "@jupyter-widgets/base",
516 | "_view_module_version": "1.2.0",
517 | "_view_name": "LayoutView",
518 | "align_content": null,
519 | "align_items": null,
520 | "align_self": null,
521 | "border": null,
522 | "bottom": null,
523 | "display": null,
524 | "flex": null,
525 | "flex_flow": null,
526 | "grid_area": null,
527 | "grid_auto_columns": null,
528 | "grid_auto_flow": null,
529 | "grid_auto_rows": null,
530 | "grid_column": null,
531 | "grid_gap": null,
532 | "grid_row": null,
533 | "grid_template_areas": null,
534 | "grid_template_columns": null,
535 | "grid_template_rows": null,
536 | "height": null,
537 | "justify_content": null,
538 | "justify_items": null,
539 | "left": null,
540 | "margin": null,
541 | "max_height": null,
542 | "max_width": null,
543 | "min_height": null,
544 | "min_width": null,
545 | "object_fit": null,
546 | "object_position": null,
547 | "order": null,
548 | "overflow": null,
549 | "overflow_x": null,
550 | "overflow_y": null,
551 | "padding": null,
552 | "right": null,
553 | "top": null,
554 | "visibility": null,
555 | "width": null
556 | }
557 | },
558 | "53c8d989d57443ce8cd5861721bbdb41": {
559 | "model_module": "@jupyter-widgets/controls",
560 | "model_module_version": "1.5.0",
561 | "model_name": "ProgressStyleModel",
562 | "state": {
563 | "_model_module": "@jupyter-widgets/controls",
564 | "_model_module_version": "1.5.0",
565 | "_model_name": "ProgressStyleModel",
566 | "_view_count": null,
567 | "_view_module": "@jupyter-widgets/base",
568 | "_view_module_version": "1.2.0",
569 | "_view_name": "StyleView",
570 | "bar_color": null,
571 | "description_width": ""
572 | }
573 | },
574 | "6a59d212d5184dc39ce77c279ebca348": {
575 | "model_module": "@jupyter-widgets/controls",
576 | "model_module_version": "1.5.0",
577 | "model_name": "DescriptionStyleModel",
578 | "state": {
579 | "_model_module": "@jupyter-widgets/controls",
580 | "_model_module_version": "1.5.0",
581 | "_model_name": "DescriptionStyleModel",
582 | "_view_count": null,
583 | "_view_module": "@jupyter-widgets/base",
584 | "_view_module_version": "1.2.0",
585 | "_view_name": "StyleView",
586 | "description_width": ""
587 | }
588 | },
589 | "7c5d9781693245f8bf888e8f909f097f": {
590 | "model_module": "@jupyter-widgets/controls",
591 | "model_module_version": "1.5.0",
592 | "model_name": "DescriptionStyleModel",
593 | "state": {
594 | "_model_module": "@jupyter-widgets/controls",
595 | "_model_module_version": "1.5.0",
596 | "_model_name": "DescriptionStyleModel",
597 | "_view_count": null,
598 | "_view_module": "@jupyter-widgets/base",
599 | "_view_module_version": "1.2.0",
600 | "_view_name": "StyleView",
601 | "description_width": ""
602 | }
603 | },
604 | "952a134784564a40a26f60fdc24a4b29": {
605 | "model_module": "@jupyter-widgets/controls",
606 | "model_module_version": "1.5.0",
607 | "model_name": "HTMLModel",
608 | "state": {
609 | "_dom_classes": [],
610 | "_model_module": "@jupyter-widgets/controls",
611 | "_model_module_version": "1.5.0",
612 | "_model_name": "HTMLModel",
613 | "_view_count": null,
614 | "_view_module": "@jupyter-widgets/controls",
615 | "_view_module_version": "1.5.0",
616 | "_view_name": "HTMLView",
617 | "description": "",
618 | "description_tooltip": null,
619 | "layout": "IPY_MODEL_fd32ad532d594347849bf1aa89fa0b15",
620 | "placeholder": "",
621 | "style": "IPY_MODEL_7c5d9781693245f8bf888e8f909f097f",
622 | "value": "Resolving data files: 100%"
623 | }
624 | },
625 | "a1e41ea2a9dd4138bcba2643d472ec78": {
626 | "model_module": "@jupyter-widgets/base",
627 | "model_module_version": "1.2.0",
628 | "model_name": "LayoutModel",
629 | "state": {
630 | "_model_module": "@jupyter-widgets/base",
631 | "_model_module_version": "1.2.0",
632 | "_model_name": "LayoutModel",
633 | "_view_count": null,
634 | "_view_module": "@jupyter-widgets/base",
635 | "_view_module_version": "1.2.0",
636 | "_view_name": "LayoutView",
637 | "align_content": null,
638 | "align_items": null,
639 | "align_self": null,
640 | "border": null,
641 | "bottom": null,
642 | "display": null,
643 | "flex": null,
644 | "flex_flow": null,
645 | "grid_area": null,
646 | "grid_auto_columns": null,
647 | "grid_auto_flow": null,
648 | "grid_auto_rows": null,
649 | "grid_column": null,
650 | "grid_gap": null,
651 | "grid_row": null,
652 | "grid_template_areas": null,
653 | "grid_template_columns": null,
654 | "grid_template_rows": null,
655 | "height": null,
656 | "justify_content": null,
657 | "justify_items": null,
658 | "left": null,
659 | "margin": null,
660 | "max_height": null,
661 | "max_width": null,
662 | "min_height": null,
663 | "min_width": null,
664 | "object_fit": null,
665 | "object_position": null,
666 | "order": null,
667 | "overflow": null,
668 | "overflow_x": null,
669 | "overflow_y": null,
670 | "padding": null,
671 | "right": null,
672 | "top": null,
673 | "visibility": null,
674 | "width": null
675 | }
676 | },
677 | "b9da0257d25e47d087dda0644be79d0f": {
678 | "model_module": "@jupyter-widgets/controls",
679 | "model_module_version": "1.5.0",
680 | "model_name": "HBoxModel",
681 | "state": {
682 | "_dom_classes": [],
683 | "_model_module": "@jupyter-widgets/controls",
684 | "_model_module_version": "1.5.0",
685 | "_model_name": "HBoxModel",
686 | "_view_count": null,
687 | "_view_module": "@jupyter-widgets/controls",
688 | "_view_module_version": "1.5.0",
689 | "_view_name": "HBoxView",
690 | "box_style": "",
691 | "children": [
692 | "IPY_MODEL_952a134784564a40a26f60fdc24a4b29",
693 | "IPY_MODEL_c86d3e5c71804299bd3626eda5ad6bec",
694 | "IPY_MODEL_3d78d66f951c48969c2036764cd415d0"
695 | ],
696 | "layout": "IPY_MODEL_ed5c1bfee6084b40953064d2cf66ebdd"
697 | }
698 | },
699 | "bc96b42534554357872d4497616054c4": {
700 | "model_module": "@jupyter-widgets/base",
701 | "model_module_version": "1.2.0",
702 | "model_name": "LayoutModel",
703 | "state": {
704 | "_model_module": "@jupyter-widgets/base",
705 | "_model_module_version": "1.2.0",
706 | "_model_name": "LayoutModel",
707 | "_view_count": null,
708 | "_view_module": "@jupyter-widgets/base",
709 | "_view_module_version": "1.2.0",
710 | "_view_name": "LayoutView",
711 | "align_content": null,
712 | "align_items": null,
713 | "align_self": null,
714 | "border": null,
715 | "bottom": null,
716 | "display": null,
717 | "flex": null,
718 | "flex_flow": null,
719 | "grid_area": null,
720 | "grid_auto_columns": null,
721 | "grid_auto_flow": null,
722 | "grid_auto_rows": null,
723 | "grid_column": null,
724 | "grid_gap": null,
725 | "grid_row": null,
726 | "grid_template_areas": null,
727 | "grid_template_columns": null,
728 | "grid_template_rows": null,
729 | "height": null,
730 | "justify_content": null,
731 | "justify_items": null,
732 | "left": null,
733 | "margin": null,
734 | "max_height": null,
735 | "max_width": null,
736 | "min_height": null,
737 | "min_width": null,
738 | "object_fit": null,
739 | "object_position": null,
740 | "order": null,
741 | "overflow": null,
742 | "overflow_x": null,
743 | "overflow_y": null,
744 | "padding": null,
745 | "right": null,
746 | "top": null,
747 | "visibility": null,
748 | "width": null
749 | }
750 | },
751 | "c86d3e5c71804299bd3626eda5ad6bec": {
752 | "model_module": "@jupyter-widgets/controls",
753 | "model_module_version": "1.5.0",
754 | "model_name": "FloatProgressModel",
755 | "state": {
756 | "_dom_classes": [],
757 | "_model_module": "@jupyter-widgets/controls",
758 | "_model_module_version": "1.5.0",
759 | "_model_name": "FloatProgressModel",
760 | "_view_count": null,
761 | "_view_module": "@jupyter-widgets/controls",
762 | "_view_module_version": "1.5.0",
763 | "_view_name": "ProgressView",
764 | "bar_style": "success",
765 | "description": "",
766 | "description_tooltip": null,
767 | "layout": "IPY_MODEL_4289ac665bd3494b8091af46aa03d27b",
768 | "max": 250,
769 | "min": 0,
770 | "orientation": "horizontal",
771 | "style": "IPY_MODEL_05eafd28789747c5b8541e81d392aef4",
772 | "value": 250
773 | }
774 | },
775 | "e4457ef8ca424b74a546e2fa85cdf69f": {
776 | "model_module": "@jupyter-widgets/base",
777 | "model_module_version": "1.2.0",
778 | "model_name": "LayoutModel",
779 | "state": {
780 | "_model_module": "@jupyter-widgets/base",
781 | "_model_module_version": "1.2.0",
782 | "_model_name": "LayoutModel",
783 | "_view_count": null,
784 | "_view_module": "@jupyter-widgets/base",
785 | "_view_module_version": "1.2.0",
786 | "_view_name": "LayoutView",
787 | "align_content": null,
788 | "align_items": null,
789 | "align_self": null,
790 | "border": null,
791 | "bottom": null,
792 | "display": null,
793 | "flex": null,
794 | "flex_flow": null,
795 | "grid_area": null,
796 | "grid_auto_columns": null,
797 | "grid_auto_flow": null,
798 | "grid_auto_rows": null,
799 | "grid_column": null,
800 | "grid_gap": null,
801 | "grid_row": null,
802 | "grid_template_areas": null,
803 | "grid_template_columns": null,
804 | "grid_template_rows": null,
805 | "height": null,
806 | "justify_content": null,
807 | "justify_items": null,
808 | "left": null,
809 | "margin": null,
810 | "max_height": null,
811 | "max_width": null,
812 | "min_height": null,
813 | "min_width": null,
814 | "object_fit": null,
815 | "object_position": null,
816 | "order": null,
817 | "overflow": null,
818 | "overflow_x": null,
819 | "overflow_y": null,
820 | "padding": null,
821 | "right": null,
822 | "top": null,
823 | "visibility": null,
824 | "width": null
825 | }
826 | },
827 | "ed5c1bfee6084b40953064d2cf66ebdd": {
828 | "model_module": "@jupyter-widgets/base",
829 | "model_module_version": "1.2.0",
830 | "model_name": "LayoutModel",
831 | "state": {
832 | "_model_module": "@jupyter-widgets/base",
833 | "_model_module_version": "1.2.0",
834 | "_model_name": "LayoutModel",
835 | "_view_count": null,
836 | "_view_module": "@jupyter-widgets/base",
837 | "_view_module_version": "1.2.0",
838 | "_view_name": "LayoutView",
839 | "align_content": null,
840 | "align_items": null,
841 | "align_self": null,
842 | "border": null,
843 | "bottom": null,
844 | "display": null,
845 | "flex": null,
846 | "flex_flow": null,
847 | "grid_area": null,
848 | "grid_auto_columns": null,
849 | "grid_auto_flow": null,
850 | "grid_auto_rows": null,
851 | "grid_column": null,
852 | "grid_gap": null,
853 | "grid_row": null,
854 | "grid_template_areas": null,
855 | "grid_template_columns": null,
856 | "grid_template_rows": null,
857 | "height": null,
858 | "justify_content": null,
859 | "justify_items": null,
860 | "left": null,
861 | "margin": null,
862 | "max_height": null,
863 | "max_width": null,
864 | "min_height": null,
865 | "min_width": null,
866 | "object_fit": null,
867 | "object_position": null,
868 | "order": null,
869 | "overflow": null,
870 | "overflow_x": null,
871 | "overflow_y": null,
872 | "padding": null,
873 | "right": null,
874 | "top": null,
875 | "visibility": null,
876 | "width": null
877 | }
878 | },
879 | "fd32ad532d594347849bf1aa89fa0b15": {
880 | "model_module": "@jupyter-widgets/base",
881 | "model_module_version": "1.2.0",
882 | "model_name": "LayoutModel",
883 | "state": {
884 | "_model_module": "@jupyter-widgets/base",
885 | "_model_module_version": "1.2.0",
886 | "_model_name": "LayoutModel",
887 | "_view_count": null,
888 | "_view_module": "@jupyter-widgets/base",
889 | "_view_module_version": "1.2.0",
890 | "_view_name": "LayoutView",
891 | "align_content": null,
892 | "align_items": null,
893 | "align_self": null,
894 | "border": null,
895 | "bottom": null,
896 | "display": null,
897 | "flex": null,
898 | "flex_flow": null,
899 | "grid_area": null,
900 | "grid_auto_columns": null,
901 | "grid_auto_flow": null,
902 | "grid_auto_rows": null,
903 | "grid_column": null,
904 | "grid_gap": null,
905 | "grid_row": null,
906 | "grid_template_areas": null,
907 | "grid_template_columns": null,
908 | "grid_template_rows": null,
909 | "height": null,
910 | "justify_content": null,
911 | "justify_items": null,
912 | "left": null,
913 | "margin": null,
914 | "max_height": null,
915 | "max_width": null,
916 | "min_height": null,
917 | "min_width": null,
918 | "object_fit": null,
919 | "object_position": null,
920 | "order": null,
921 | "overflow": null,
922 | "overflow_x": null,
923 | "overflow_y": null,
924 | "padding": null,
925 | "right": null,
926 | "top": null,
927 | "visibility": null,
928 | "width": null
929 | }
930 | }
931 | }
932 | }
933 | },
934 | "nbformat": 4,
935 | "nbformat_minor": 4
936 | }
937 |
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/Comparision_of_Model_Scores.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Comparision_of_Model_Scores.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/Experiment Canvas.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Experiment Canvas.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/Llama-3-8B-vs-6B-v0.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Llama-3-8B-vs-6B-v0.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/Training Loss.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/Training Loss.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/downcycling.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/downcycling.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/llama-3-6B icon.jpeg:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/llama-3-6B icon.jpeg
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/model_scores.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/model_scores.png
--------------------------------------------------------------------------------
/Llama-3/Part 2/assets/model_scores_llama_3_8B.png:
--------------------------------------------------------------------------------
https://raw.githubusercontent.com/Blaizzy/Coding-LLMs-from-scratch/9c3a182fcfbe898f532c9de947a5fc692ca4cc6b/Llama-3/Part 2/assets/model_scores_llama_3_8B.png
--------------------------------------------------------------------------------
/README.md:
--------------------------------------------------------------------------------
1 | # Coding LLMs from scratch
2 | # Coding Llama-2
3 | You will learn how to train and fine-tune Llama 2 model from scratch.
4 |
5 | Throught the series you will learn about transformers architecture, different attention mechanisms (MHA, MQA and GQA), KV cache, RoPE, and Hugginface Trainer in detail.
6 |
7 | By the end, you will have created and trained a LLaMA 2 model with 100M parameters from scratch using PyTorch to do code completion.
8 |
9 | 🎥 **YT Video Playlist:**
10 | - https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
11 |
12 |
13 |
14 | # Coding Llama-3
15 |
16 | You will learn how to train and fine-tune Llama 3 model from scratch.
17 |
18 | The goal is to code LLaMA 3 from scratch in PyTorch to create models with sizes 3B, 6B, 35B and 45B params.
19 |
20 | 🎥 **YT Video Playlist:**
21 | - https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
22 |
23 | 📚 **Papers**:
24 | - Sparse Upcycling Training Mixture-of-Experts from Dense Checkpoints
25 | : https://arxiv.org/abs/2212.05055
26 | - Pre-training Small Base LMs with Fewer Tokens: https://arxiv.org/abs/2404.08634
27 | Leave No Context Behind Efficient Infinite Context Transformers with Infini-attention: https://arxiv.org/abs/2404.07143
28 |
29 |
30 |
31 | ## Llama-3-6B-v0.1
32 |
33 |
34 | Introducing the world's first Llama-3 base model with 6B parameters. This model is a pretrained version of [prince-canuma/Llama-3-6B-v0](https://huggingface.co/prince-canuma/Llama-3-6B-v0), which was created from Meta-Llama-3-8B using a technique called [downcycling](https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=9hcOol4KHIgWThgt) .
35 | The model was continually pretrained on 1 billion tokens of English-only text from fineweb, achieving impressive results on the evaluation set:
36 | - Loss: 2.4942
37 |
38 |
39 | ## Model Description
40 |
41 | - **Developed by:** [Prince Canuma](https://huggingface.co/prince-canuma)
42 | - **Sponsored by:** General
43 | - **Model type:** Llama
44 | - **License:** [Llama-3](https://llama.meta.com/llama3/license)
45 | - **Pretrained from model:** prince-canuma/Llama-3-6B-v0
46 |
47 | ### Model Sources
48 |
49 | - **Repository:** https://github.com/Blaizzy/Coding-LLMs-from-scratch/tree/main/Llama-3
50 | - **Video:** https://youtube.com/playlist?list=PLDn_JsyofyfTH5_5V1MNb8UYKxMl6IMNy&si=5Y4cm-6wrMOD1Abr
51 |
52 | ## Uses
53 |
54 |
55 | You can use this model to create instruct and chat versions for various use cases such as: Coding assistant, RAG, Function Calling and more.
56 |
57 | ### Limitations
58 |
59 | This model inherits some of the base model's limitations and some additional ones from it's creation process, such as:
60 | - Limited scope for coding and math: According to benchmarks, this model needs more pretraining/finetuning on code and math data to excel at reasoning tasks.
61 | - Language Limitations: This model was continually pretrained on english only data. If you are planning to use it for multilingual use cases I recommend fine-tuning or continued pretraining.
62 |
63 |
64 | ## Read more
65 | https://huggingface.co/prince-canuma/Llama-3-6B-v0.1
--------------------------------------------------------------------------------