├── DhammaBerg_GPT2.ipynb └── README.md /DhammaBerg_GPT2.ipynb: -------------------------------------------------------------------------------- 1 | { 2 | "nbformat": 4, 3 | "nbformat_minor": 0, 4 | "metadata": { 5 | "colab": { 6 | "name": "DhammaBerg.AI", 7 | "version": "0.3.2", 8 | "provenance": [], 9 | "private_outputs": true, 10 | "collapsed_sections": [ 11 | "RMi46NNXcJkI", 12 | "PXgEKCOjdWjf", 13 | "yPfJ5b3CQXqr", 14 | "OxXYe-ZN9Oj4" 15 | ], 16 | "toc_visible": true 17 | }, 18 | "kernelspec": { 19 | "name": "python3", 20 | "display_name": "Python 3" 21 | }, 22 | "accelerator": "GPU" 23 | }, 24 | "cells": [ 25 | { 26 | "metadata": { 27 | "id": "2hxLV4YTmcL7", 28 | "colab_type": "text" 29 | }, 30 | "cell_type": "markdown", 31 | "source": [ 32 | "# GPT-2 117M Fine Tuning\n", 33 | "\n", 34 | "Adaptation of https://github.com/ak9250/gpt-2-colab\n", 35 | "\n", 36 | "Includes code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly since it's only 6 MB, but a 99 MB subset of Gutenberg came out quite convincing after only 4 hours of training. Your mileage may vary." 37 | ] 38 | }, 39 | { 40 | "metadata": { 41 | "id": "Pzxl1vYX-1kk", 42 | "colab_type": "text" 43 | }, 44 | "cell_type": "markdown", 45 | "source": [ 46 | "## Setup\n", 47 | "\n", 48 | "1) Make sure GPU is enabled, go to edit->notebook settings->Hardware Accelerator GPU\n", 49 | "\n", 50 | "2) make a copy to your google drive, click on copy to drive in panel" 51 | ] 52 | }, 53 | { 54 | "metadata": { 55 | "id": "iW0abT07ZkhZ", 56 | "colab_type": "text" 57 | }, 58 | "cell_type": "markdown", 59 | "source": [ 60 | "Note: colab will reset after 12 hours make sure to save your model checkpoints to google drive around 10-11 hours mark or before, then go to runtime->reset all runtimes. Now copy your train model back into colab and start training again from the previous checkpoint." 61 | ] 62 | }, 63 | { 64 | "metadata": { 65 | "id": "Z4DHSFr6cRF8", 66 | "colab_type": "text" 67 | }, 68 | "cell_type": "markdown", 69 | "source": [ 70 | "### Initialize" 71 | ] 72 | }, 73 | { 74 | "metadata": { 75 | "id": "iLXW02eIYpcB", 76 | "colab_type": "text" 77 | }, 78 | "cell_type": "markdown", 79 | "source": [ 80 | "clone and cd into repo" 81 | ] 82 | }, 83 | { 84 | "metadata": { 85 | "id": "ICYu3w9hIJkC", 86 | "colab_type": "code", 87 | "colab": {} 88 | }, 89 | "cell_type": "code", 90 | "source": [ 91 | "!git clone https://github.com/nshepperd/gpt-2.git" 92 | ], 93 | "execution_count": 0, 94 | "outputs": [] 95 | }, 96 | { 97 | "metadata": { 98 | "id": "Qtn1qZPgZLb0", 99 | "colab_type": "text" 100 | }, 101 | "cell_type": "markdown", 102 | "source": [ 103 | "install requirements" 104 | ] 105 | }, 106 | { 107 | "metadata": { 108 | "id": "434oOx0bZH6J", 109 | "colab_type": "code", 110 | "colab": {} 111 | }, 112 | "cell_type": "code", 113 | "source": [ 114 | "!pip3 install --upgrade tensorflow-gpu beautifulsoup4\n", 115 | "!pip3 install -r requirements.txt" 116 | ], 117 | "execution_count": 0, 118 | "outputs": [] 119 | }, 120 | { 121 | "metadata": { 122 | "id": "6eEIs3ApZUVO", 123 | "colab_type": "code", 124 | "colab": {} 125 | }, 126 | "cell_type": "code", 127 | "source": [ 128 | "cd gpt-2" 129 | ], 130 | "execution_count": 0, 131 | "outputs": [] 132 | }, 133 | { 134 | "metadata": { 135 | "id": "o1hrgeKFYsuE", 136 | "colab_type": "text" 137 | }, 138 | "cell_type": "markdown", 139 | "source": [ 140 | "download the model" 141 | ] 142 | }, 143 | { 144 | "metadata": { 145 | "colab_type": "code", 146 | "id": "A498TySgHYyF", 147 | "colab": {} 148 | }, 149 | "cell_type": "code", 150 | "source": [ 151 | "!python3 download_model.py 117M" 152 | ], 153 | "execution_count": 0, 154 | "outputs": [] 155 | }, 156 | { 157 | "metadata": { 158 | "id": "eyeiSvqmfZNV", 159 | "colab_type": "text" 160 | }, 161 | "cell_type": "markdown", 162 | "source": [ 163 | "set encoding" 164 | ] 165 | }, 166 | { 167 | "metadata": { 168 | "id": "7oJPQtdLbbeK", 169 | "colab_type": "code", 170 | "colab": {} 171 | }, 172 | "cell_type": "code", 173 | "source": [ 174 | "!export PYTHONIOENCODING=UTF-8" 175 | ], 176 | "execution_count": 0, 177 | "outputs": [] 178 | }, 179 | { 180 | "metadata": { 181 | "id": "RMi46NNXcJkI", 182 | "colab_type": "text" 183 | }, 184 | "cell_type": "markdown", 185 | "source": [ 186 | "### Mount Google Drive" 187 | ] 188 | }, 189 | { 190 | "metadata": { 191 | "id": "WvUQhgK3PQ4L", 192 | "colab_type": "text" 193 | }, 194 | "cell_type": "markdown", 195 | "source": [ 196 | "mount drive to access google drive for saving and accessing checkpoints later" 197 | ] 198 | }, 199 | { 200 | "metadata": { 201 | "id": "FNpf6R4ahYSN", 202 | "colab_type": "code", 203 | "colab": {} 204 | }, 205 | "cell_type": "code", 206 | "source": [ 207 | "from google.colab import drive\n", 208 | "drive.mount('/content/drive')" 209 | ], 210 | "execution_count": 0, 211 | "outputs": [] 212 | }, 213 | { 214 | "metadata": { 215 | "id": "0KzSbAvePgsI", 216 | "colab_type": "text" 217 | }, 218 | "cell_type": "markdown", 219 | "source": [ 220 | "(optional) fetch checkpoints if you have them saved in google drive" 221 | ] 222 | }, 223 | { 224 | "metadata": { 225 | "id": "cA2Wk7yIPmS6", 226 | "colab_type": "code", 227 | "colab": {} 228 | }, 229 | "cell_type": "code", 230 | "source": [ 231 | "!cp -r /content/drive/My\\ Drive/checkpoint/ /content/gpt-2/ " 232 | ], 233 | "execution_count": 0, 234 | "outputs": [] 235 | }, 236 | { 237 | "metadata": { 238 | "id": "NSc7_rYkcZbk", 239 | "colab_type": "text" 240 | }, 241 | "cell_type": "markdown", 242 | "source": [ 243 | "## Download + Prepare Training Data" 244 | ] 245 | }, 246 | { 247 | "metadata": { 248 | "id": "8NGvwu_Ucefp", 249 | "colab_type": "text" 250 | }, 251 | "cell_type": "markdown", 252 | "source": [ 253 | "" 254 | ] 255 | }, 256 | { 257 | "metadata": { 258 | "id": "k-o_fWOVFrEA", 259 | "colab_type": "code", 260 | "colab": {} 261 | }, 262 | "cell_type": "code", 263 | "source": [ 264 | "cd /content/gpt-2" 265 | ], 266 | "execution_count": 0, 267 | "outputs": [] 268 | }, 269 | { 270 | "metadata": { 271 | "id": "AKAOgbcmczdE", 272 | "colab_type": "code", 273 | "colab": {} 274 | }, 275 | "cell_type": "code", 276 | "source": [ 277 | "mkdir data" 278 | ], 279 | "execution_count": 0, 280 | "outputs": [] 281 | }, 282 | { 283 | "metadata": { 284 | "id": "Oo0Q5GWLc1Iu", 285 | "colab_type": "code", 286 | "colab": {} 287 | }, 288 | "cell_type": "code", 289 | "source": [ 290 | "cd data" 291 | ], 292 | "execution_count": 0, 293 | "outputs": [] 294 | }, 295 | { 296 | "metadata": { 297 | "id": "0p--9zwqQRTc", 298 | "colab_type": "text" 299 | }, 300 | "cell_type": "markdown", 301 | "source": [ 302 | "### Project Gutenberg\n", 303 | "\n", 304 | "Download through their [Robot Access](http://www.gutenberg.org/wiki/Gutenberg:Information_About_Robot_Access_to_our_Pages)" 305 | ] 306 | }, 307 | { 308 | "metadata": { 309 | "id": "IpArPqxBPY3y", 310 | "colab_type": "code", 311 | "colab": {} 312 | }, 313 | "cell_type": "code", 314 | "source": [ 315 | "mkdir books" 316 | ], 317 | "execution_count": 0, 318 | "outputs": [] 319 | }, 320 | { 321 | "metadata": { 322 | "id": "Oi69zY2Jiu4b", 323 | "colab_type": "text" 324 | }, 325 | "cell_type": "markdown", 326 | "source": [ 327 | "download and unzip (this will take a while)" 328 | ] 329 | }, 330 | { 331 | "metadata": { 332 | "id": "QOCvrs-DHvxa", 333 | "colab_type": "code", 334 | "colab": {} 335 | }, 336 | "cell_type": "code", 337 | "source": [ 338 | "!wget -w 0.5 -m -H \"http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en\"" 339 | ], 340 | "execution_count": 0, 341 | "outputs": [] 342 | }, 343 | { 344 | "metadata": { 345 | "id": "K7WfbFtqmJ3n", 346 | "colab_type": "code", 347 | "colab": {} 348 | }, 349 | "cell_type": "code", 350 | "source": [ 351 | "!find . -name \"*[!-8].zip\" | while read filename; do unzip -o -d \"`basename -s .zip \"$filename\"`\" \"$filename\"; done;" 352 | ], 353 | "execution_count": 0, 354 | "outputs": [] 355 | }, 356 | { 357 | "metadata": { 358 | "id": "rWTB_loVdJgs", 359 | "colab_type": "text" 360 | }, 361 | "cell_type": "markdown", 362 | "source": [ 363 | "collect text files for conversion" 364 | ] 365 | }, 366 | { 367 | "metadata": { 368 | "id": "ohiK40aamuNJ", 369 | "colab_type": "code", 370 | "colab": {} 371 | }, 372 | "cell_type": "code", 373 | "source": [ 374 | "!find . -name \"*.txt\" | while read filename; do cp $filename /content/gpt-2/books/; done;" 375 | ], 376 | "execution_count": 0, 377 | "outputs": [] 378 | }, 379 | { 380 | "metadata": { 381 | "id": "FW9yT-3OPmNL", 382 | "colab_type": "text" 383 | }, 384 | "cell_type": "markdown", 385 | "source": [ 386 | "change charset to utf8" 387 | ] 388 | }, 389 | { 390 | "metadata": { 391 | "id": "ENxoKbCYkW8T", 392 | "colab_type": "code", 393 | "colab": {} 394 | }, 395 | "cell_type": "code", 396 | "source": [ 397 | "!find . -name \"*.txt\" | while read filename; do iconv -f ascii -t utf8 $filename > $filename-utf8.txt ; done;" 398 | ], 399 | "execution_count": 0, 400 | "outputs": [] 401 | }, 402 | { 403 | "metadata": { 404 | "id": "59bfzKEJPplb", 405 | "colab_type": "text" 406 | }, 407 | "cell_type": "markdown", 408 | "source": [ 409 | "combine into single text file" 410 | ] 411 | }, 412 | { 413 | "metadata": { 414 | "id": "-iAs2XT61sU9", 415 | "colab_type": "code", 416 | "colab": {} 417 | }, 418 | "cell_type": "code", 419 | "source": [ 420 | "cat *utf8.txt >> allbooks-utf8.txt" 421 | ], 422 | "execution_count": 0, 423 | "outputs": [] 424 | }, 425 | { 426 | "metadata": { 427 | "id": "PXgEKCOjdWjf", 428 | "colab_type": "text" 429 | }, 430 | "cell_type": "markdown", 431 | "source": [ 432 | "### Access to Insight\n", 433 | "\n", 434 | "Bulk download from [this page](https://accesstoinsight.org/tech/download/bulk.html)" 435 | ] 436 | }, 437 | { 438 | "metadata": { 439 | "id": "_kZEr0r2djTr", 440 | "colab_type": "code", 441 | "colab": {} 442 | }, 443 | "cell_type": "code", 444 | "source": [ 445 | "!wget \"http://accesstoinsight.org/tech/download/ati.zip\"" 446 | ], 447 | "execution_count": 0, 448 | "outputs": [] 449 | }, 450 | { 451 | "metadata": { 452 | "id": "a5FH7EAEeBVG", 453 | "colab_type": "text" 454 | }, 455 | "cell_type": "markdown", 456 | "source": [ 457 | "unzip the archive" 458 | ] 459 | }, 460 | { 461 | "metadata": { 462 | "id": "Az4sTE34eDCO", 463 | "colab_type": "code", 464 | "colab": {} 465 | }, 466 | "cell_type": "code", 467 | "source": [ 468 | "!unzip ati.zip" 469 | ], 470 | "execution_count": 0, 471 | "outputs": [] 472 | }, 473 | { 474 | "metadata": { 475 | "id": "MiJPoejOeIWG", 476 | "colab_type": "code", 477 | "colab": {} 478 | }, 479 | "cell_type": "code", 480 | "source": [ 481 | "cd /content/gpt-2/data/ati" 482 | ], 483 | "execution_count": 0, 484 | "outputs": [] 485 | }, 486 | { 487 | "metadata": { 488 | "id": "69NoNXPB29Id", 489 | "colab_type": "text" 490 | }, 491 | "cell_type": "markdown", 492 | "source": [ 493 | "#### Convert HTML to Text" 494 | ] 495 | }, 496 | { 497 | "metadata": { 498 | "id": "xJcDmfJ2bdmz", 499 | "colab_type": "text" 500 | }, 501 | "cell_type": "markdown", 502 | "source": [ 503 | "Parser for Access to Insight HTML dump based on [this script](https://codereview.stackexchange.com/questions/128515/parsing-locally-stored-html-files) and [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/index.html)" 504 | ] 505 | }, 506 | { 507 | "metadata": { 508 | "id": "VmZgVcsV15Id", 509 | "colab_type": "code", 510 | "colab": {} 511 | }, 512 | "cell_type": "code", 513 | "source": [ 514 | "from bs4 import BeautifulSoup\n", 515 | "import glob\n", 516 | "import os\n", 517 | "import re\n", 518 | "import contextlib\n", 519 | "\n", 520 | "\n", 521 | "@contextlib.contextmanager\n", 522 | "def stdout2file(fname):\n", 523 | " import sys\n", 524 | " f = open(fname, 'w')\n", 525 | " sys.stdout = f\n", 526 | " yield\n", 527 | " sys.stdout = sys.__stdout__\n", 528 | " f.close()\n", 529 | "\n", 530 | "\n", 531 | "def parser():\n", 532 | " # os.chdir(r\"/contents/gpt-2/data/ati/)\n", 533 | " with stdout2file(\"dhamma_cleaned.txt\"):\n", 534 | " for file in glob.iglob('tipitaka/**/*.html', recursive=True):\n", 535 | " with open(file, encoding=\"utf8\") as f:\n", 536 | " contents = f.read()\n", 537 | " soup = BeautifulSoup(contents, \"html.parser\")\n", 538 | " for item in soup.find_all([\"blockquote\",\"h4\",\"p\"]):\n", 539 | " print(item.get_text())\n", 540 | " print('\\n')\n", 541 | " # break\n", 542 | "parser()" 543 | ], 544 | "execution_count": 0, 545 | "outputs": [] 546 | }, 547 | { 548 | "metadata": { 549 | "id": "yPfJ5b3CQXqr", 550 | "colab_type": "text" 551 | }, 552 | "cell_type": "markdown", 553 | "source": [ 554 | "## Train the Model" 555 | ] 556 | }, 557 | { 558 | "metadata": { 559 | "id": "yft7g5w2bOX4", 560 | "colab_type": "text" 561 | }, 562 | "cell_type": "markdown", 563 | "source": [ 564 | "enter the directory" 565 | ] 566 | }, 567 | { 568 | "metadata": { 569 | "id": "DCoJhKk-2Bx9", 570 | "colab_type": "code", 571 | "colab": {} 572 | }, 573 | "cell_type": "code", 574 | "source": [ 575 | "cd gpt-2" 576 | ], 577 | "execution_count": 0, 578 | "outputs": [] 579 | }, 580 | { 581 | "metadata": { 582 | "id": "CBiLbT80bMoM", 583 | "colab_type": "text" 584 | }, 585 | "cell_type": "markdown", 586 | "source": [ 587 | "initiate training (set to data file created in previous step)" 588 | ] 589 | }, 590 | { 591 | "metadata": { 592 | "id": "pEn_ihcGI00T", 593 | "colab_type": "code", 594 | "colab": {} 595 | }, 596 | "cell_type": "code", 597 | "source": [ 598 | "!PYTHONPATH=src ./train.py --dataset /content/gpt-2/data/dhamma_cleaned.txt" 599 | ], 600 | "execution_count": 0, 601 | "outputs": [] 602 | }, 603 | { 604 | "metadata": { 605 | "id": "vS1RJJDFOPnb", 606 | "colab_type": "text" 607 | }, 608 | "cell_type": "markdown", 609 | "source": [ 610 | "save our checkpoints to start training again later" 611 | ] 612 | }, 613 | { 614 | "metadata": { 615 | "id": "JretqG1zOXdi", 616 | "colab_type": "code", 617 | "colab": {} 618 | }, 619 | "cell_type": "code", 620 | "source": [ 621 | "!cp -r /content/gpt-2/checkpoint/ /content/drive/My\\ Drive/" 622 | ], 623 | "execution_count": 0, 624 | "outputs": [] 625 | }, 626 | { 627 | "metadata": { 628 | "id": "6D-i7vERWbNS", 629 | "colab_type": "text" 630 | }, 631 | "cell_type": "markdown", 632 | "source": [ 633 | "copy re-trained (fine-tuned) model into the main directory" 634 | ] 635 | }, 636 | { 637 | "metadata": { 638 | "id": "VeETvWvrbKga", 639 | "colab_type": "code", 640 | "colab": {} 641 | }, 642 | "cell_type": "code", 643 | "source": [ 644 | "!cp -r /content/gpt-2/checkpoint/run1/* /content/gpt-2/models/117M/" 645 | ], 646 | "execution_count": 0, 647 | "outputs": [] 648 | }, 649 | { 650 | "metadata": { 651 | "id": "OxXYe-ZN9Oj4", 652 | "colab_type": "text" 653 | }, 654 | "cell_type": "markdown", 655 | "source": [ 656 | "## Use the Trained Model\n", 657 | "\n", 658 | "There are a few flags available, with a default value:\n", 659 | "\n", 660 | "* `seed = None` || a random value is generated unless specified. give a specific integer value if you want to reproduce same results in the future.\n", 661 | "* `nsamples = 1` || specify the number of samples you want to print\n", 662 | "* `length = None` || number of tokens (words) to print on each sample.\n", 663 | "* `batch_size= 1` || how many inputs you want to process simultaneously. doesn't seem to affect the results.\n", 664 | "* `temperature = 1` || scales logits before sampling prior to softmax.\n", 665 | "* `top_k = 0` || truncates the set of logits considered to those with the highest values." 666 | ] 667 | }, 668 | { 669 | "metadata": { 670 | "id": "GmnSrXqtfRbq", 671 | "colab_type": "text" 672 | }, 673 | "cell_type": "markdown", 674 | "source": [ 675 | "### Conditional samples" 676 | ] 677 | }, 678 | { 679 | "metadata": { 680 | "id": "utJj-iY4gHwE", 681 | "colab_type": "code", 682 | "colab": {} 683 | }, 684 | "cell_type": "code", 685 | "source": [ 686 | "!python3 /content/gpt-2/src/interactive_conditional_samples.py --top_k=40 --nsamples=3 --temperature=0.7 --length=100" 687 | ], 688 | "execution_count": 0, 689 | "outputs": [] 690 | }, 691 | { 692 | "metadata": { 693 | "id": "K8rSqkGxg5OK", 694 | "colab_type": "text" 695 | }, 696 | "cell_type": "markdown", 697 | "source": [ 698 | "### Unconditional samples" 699 | ] 700 | }, 701 | { 702 | "metadata": { 703 | "id": "LaQUEnRxWc3c", 704 | "colab_type": "code", 705 | "colab": {} 706 | }, 707 | "cell_type": "code", 708 | "source": [ 709 | "!python3 src/generate_unconditional_samples.py | tee /tmp/samples" 710 | ], 711 | "execution_count": 0, 712 | "outputs": [] 713 | } 714 | ] 715 | } 716 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # DhammaBergGPT2 # 2 | 3 | Adaptation of https://github.com/ak9250/gpt-2-colab 4 | 5 | Allows re-training of OpenAI's GPT-2 117M Model in Google Colab. 6 | 7 | Includes code for building a single huge English training text file based on Project Gutenberg (books) or Access to Insight (Dharma texts). ATI seems to overfit quite rapidly since it's only 6 MB, but a 99 MB subset of Gutenberg came out quite convincing after only 4 hours of training. Your mileage may vary. 8 | 9 | ## Samples from Gutenberg99 ## 10 | 11 | Model prompt >>> I love you. 12 | ======================================== SAMPLE 2 ======================================== 13 | Do you love me?" 14 | 15 | She had just looked at him and smiled again. 16 | 17 | "Yes," she said. "I love you." 18 | 19 | She was in tears. 20 | 21 | "But I will not keep you long," he said. 22 | 23 | - - - 24 | 25 | Model prompt >>> The rain in Spain falls mainly in the plain. She sells seashells by the seashore. 26 | ======================================== SAMPLE 1 ======================================== 27 | The 28 | dwarf's eye, the fern, and the red-bearded beard have all been swept into 29 | the sunshine. These were the days when the rich and the poor were 30 | t 31 | ======================================== SAMPLE 2 ======================================== 32 | 33 | In the mountains there are many glaciers. We find them in the valleys of 34 | the Seychelles. They are so thin that even the best of the snow is 35 | overgrown with them 36 | ======================================== SAMPLE 3 ======================================== 37 | 38 | The city of St. Mark, in the north-west, and the city of Pampeluna are, 39 | in a general sense, the same, with a great number of inhabitants, 40 | mostly 41 | --------------------------------------------------------------------------------