├── .gitignore ├── .DS_Store ├── assets ├── my-title │ ├── pred.png │ ├── ref.png │ ├── img2latex_task.svg │ └── seq2seq_vanilla_encoder.svg ├── seal-dark-red.png ├── tensorflow-model │ └── tensorboard.png ├── project-code-examples │ └── training.png ├── js │ └── header.js └── css │ ├── main.scss │ └── custom.css ├── _includes ├── icon-github.html ├── icon-twitter.html ├── image.html ├── google-analytics.html ├── double-image.html ├── icon-twitter.svg ├── post-preview.html ├── icon-github.svg ├── footer.html ├── header.html └── head.html ├── _layouts ├── page.html ├── default.html ├── post.html └── home.html ├── README.md ├── _config.yml ├── index.md ├── Gemfile ├── _sass ├── minima.scss └── minima │ ├── _syntax-highlighting.scss │ ├── _base.scss │ └── _layout.scss ├── Gemfile.lock └── _posts ├── 2018-01-10-my-title.md ├── 2018-02-01-aws-starter-guide.md ├── 2018-06-02-session-3.md ├── 2018-02-01-train-dev-test-split.md ├── 2018-02-01-logging-hyperparams.md ├── 2018-02-01-tensorflow-getting-started.md ├── 2018-06-02-session-4.md ├── 2018-02-01-pytorch-vision.md ├── 2018-02-01-pytorch-nlp.md ├── 2018-02-01-pytorch-getting-started.md ├── 2018-02-01-project-code-examples.md ├── 2018-02-01-tensorflow-model.md └── 2018-02-01-tensorflow-input-data.md /.gitignore: -------------------------------------------------------------------------------- 1 | _site 2 | .sass-cache 3 | .jekyll-metadata 4 | -------------------------------------------------------------------------------- /.DS_Store: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/.DS_Store -------------------------------------------------------------------------------- /assets/my-title/pred.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/assets/my-title/pred.png -------------------------------------------------------------------------------- /assets/my-title/ref.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/assets/my-title/ref.png -------------------------------------------------------------------------------- /assets/seal-dark-red.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/assets/seal-dark-red.png -------------------------------------------------------------------------------- /assets/tensorflow-model/tensorboard.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/assets/tensorflow-model/tensorboard.png -------------------------------------------------------------------------------- /assets/project-code-examples/training.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/cs230-stanford/website-2018-winter/HEAD/assets/project-code-examples/training.png -------------------------------------------------------------------------------- /_includes/icon-github.html: -------------------------------------------------------------------------------- 1 | {% include icon-github.svg %}{{include.username}} 2 | -------------------------------------------------------------------------------- /_includes/icon-twitter.html: -------------------------------------------------------------------------------- 1 | {% include icon-twitter.svg %}{{ include.username }} 2 | -------------------------------------------------------------------------------- /_layouts/page.html: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | --- 4 |
5 | 6 |
7 |

{{ page.title | escape }}

8 |
9 | 10 |
11 | {{ content }} 12 |
13 | 14 |
15 | -------------------------------------------------------------------------------- /_includes/image.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 |
{{ include.description }}
{{ include.description }}
7 |

-------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Template for CS230 lectures notes on github 2 | 3 | ## Installation 4 | 5 | check out `https://jekyllrb.com/docs/quickstart/` 6 | 7 | ``` 8 | gem install jekyll bundler 9 | cd cs230-stanford.github.io 10 | bundle install 11 | ``` 12 | 13 | 14 | ## Run locally 15 | 16 | ``` 17 | bundle exec jekyll serve 18 | ``` -------------------------------------------------------------------------------- /assets/js/header.js: -------------------------------------------------------------------------------- 1 | function changeClass() { 2 | element = document.getElementById("trigger"); 3 | if (element.className == "trigger"){ 4 | element.className = "show"; 5 | } 6 | else { 7 | element.className = "trigger"; 8 | } 9 | } 10 | 11 | function hideClass() { 12 | element = document.getElementById("trigger"); 13 | if (element.className == "show") { 14 | element.className = "trigger"; 15 | } 16 | } -------------------------------------------------------------------------------- /_layouts/default.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | {% include head.html %} 5 | {% seo %} 6 | 7 | 8 | 9 | {% include header.html %} 10 | 11 |
12 |
13 | {{ content }} 14 |
15 |
16 | 17 | {% include footer.html %} 18 | 19 | 20 | 21 | 22 | -------------------------------------------------------------------------------- /_includes/google-analytics.html: -------------------------------------------------------------------------------- 1 | 11 | 12 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | title: CS230 Deep Learning 2 | email: "" 3 | description: "Material for Stanford CS230" 4 | baseurl: "" # the subpath of your site, e.g. /blog 5 | url: "https://cs230-stanford.github.io" # the base hostname & protocol for your site, e.g. http://example.com 6 | github_username: cs230-stanford 7 | google_analytics: UA-114319548-1 8 | 9 | 10 | # Build settings 11 | permalink: none 12 | markdown: kramdown 13 | theme: minima 14 | gems: 15 | - jekyll-feed 16 | - jekyll-seo-tag 17 | exclude: 18 | - Gemfile 19 | - Gemfile.lock 20 | display-site-nav: false -------------------------------------------------------------------------------- /_includes/double-image.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 5 | 7 | 8 | 9 | 10 | 11 | 12 | 13 |
4 |
{{ include.caption1 }}
6 |
{{ include.caption2 }}
{{ include.caption1 }}{{ include.caption2 }}
{{ include.description }}
14 |

-------------------------------------------------------------------------------- /index.md: -------------------------------------------------------------------------------- 1 | --- 2 | # You don't need to edit this file, it's empty on purpose. 3 | # Edit theme's home layout instead if you wanna make some changes 4 | # See: https://jekyllrb.com/docs/themes/#overriding-theme-defaults 5 | layout: home 6 | --- 7 | 8 | 9 | These notes and tutorials are meant to complement the material of Stanford's class [CS230 (Deep Learning)](http://cs230.stanford.edu) taught by Prof. Andrew Ng and Prof. Kian Katanforoosh. For questions / typos / bugs, use Piazza. These posts and this github repository give an optional structure for your final projects. Feel free to reuse this code for your final project, although you are expected to accomplish a lot more. You can also submit a pull request directly to our [github](https://github.com/cs230-stanford/cs230-stanford.github.io). -------------------------------------------------------------------------------- /_includes/icon-twitter.svg: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /_includes/post-preview.html: -------------------------------------------------------------------------------- 1 | {% for post in site.posts %} 2 | {% if post.id == include.id %} 3 |
  • 4 | 5 | 13 |
    14 |
    {{ post.title | escape }}
    15 |
    {{ post.excerpt }}
    16 |
    17 | {% if post.github %} 18 |
    github 19 | {% endif %} 20 | 21 |
  • 22 | {% endif %} 23 | {% endfor %} -------------------------------------------------------------------------------- /_includes/icon-github.svg: -------------------------------------------------------------------------------- 1 | 2 | -------------------------------------------------------------------------------- /Gemfile: -------------------------------------------------------------------------------- 1 | source "https://rubygems.org" 2 | ruby RUBY_VERSION 3 | 4 | # Hello! This is where you manage which Jekyll version is used to run. 5 | # When you want to use a different version, change it below, save the 6 | # file and run `bundle install`. Run Jekyll with `bundle exec`, like so: 7 | # 8 | # bundle exec jekyll serve 9 | # 10 | # This will help ensure the proper Jekyll version is running. 11 | # Happy Jekylling! 12 | gem "jekyll", "3.4.3" 13 | 14 | # This is the default theme for new Jekyll sites. You may change this to anything you like. 15 | gem "minima", "~> 2.0" 16 | 17 | # If you want to use GitHub Pages, remove the "gem "jekyll"" above and 18 | # uncomment the line below. To upgrade, run `bundle update github-pages`. 19 | # gem "github-pages", group: :jekyll_plugins 20 | 21 | # If you have any plugins, put them here! 22 | group :jekyll_plugins do 23 | gem "jekyll-feed", "~> 0.6" 24 | end 25 | 26 | # Windows does not include zoneinfo files, so bundle the tzinfo-data gem 27 | gem 'tzinfo-data', platforms: [:mingw, :mswin, :x64_mingw, :jruby] 28 | 29 | gem 'jekyll-seo-tag' 30 | -------------------------------------------------------------------------------- /_layouts/post.html: -------------------------------------------------------------------------------- 1 | --- 2 | layout: default 3 | --- 4 |
    5 | 6 |
    7 |

    {{ page.title | escape }}

    8 |

    {{ page.excerpt | escape }}

    9 | 10 |
    11 | {% for tag in page.tags %} 12 |
    {{ tag }}
    13 | {% endfor %} 14 |
    15 | {% if page.github %} github {% endif %} 16 |
    17 | 18 |
    19 | {{ content }} 20 |
    21 | 22 | {% include disqus_comments.html %} 23 | 24 |
    -------------------------------------------------------------------------------- /assets/css/main.scss: -------------------------------------------------------------------------------- 1 | --- 2 | # Only the main Sass file needs front matter (the dashes are enough) 3 | --- 4 | @charset "utf-8"; 5 | 6 | // Our variables 7 | $base-font-family: "Helvetica Neue", Helvetica, Arial, sans-serif; 8 | $base-font-size: 16px; 9 | $base-font-weight: 400; 10 | $small-font-size: $base-font-size * 0.875; 11 | $base-line-height: 1.5; 12 | 13 | $spacing-unit: 30px; 14 | 15 | $text-color: #111; 16 | $background-color: #fdfdfd; 17 | $brand-color: #2a7ae2; 18 | 19 | $grey-color: #828282; 20 | $grey-color-light: lighten($grey-color, 40%); 21 | $grey-color-dark: darken($grey-color, 25%); 22 | 23 | // Width of the content area 24 | $content-width: 800px; 25 | 26 | $on-palm: 600px; 27 | $on-laptop: 800px; 28 | 29 | // Minima also includes a mixin for defining media queries. 30 | // Use media queries like this: 31 | // @include media-query($on-palm) { 32 | // .wrapper { 33 | // padding-right: $spacing-unit / 2; 34 | // padding-left: $spacing-unit / 2; 35 | // } 36 | // } 37 | 38 | // Import partials from the `minima` theme. 39 | @import "minima"; 40 | -------------------------------------------------------------------------------- /_sass/minima.scss: -------------------------------------------------------------------------------- 1 | // Define defaults for each variable. 2 | 3 | $base-font-family: "Helvetica Neue", Helvetica, Arial, sans-serif !default; 4 | $base-font-size: 16px !default; 5 | $base-font-weight: 400 !default; 6 | $small-font-size: $base-font-size * 0.875 !default; 7 | $base-line-height: 1.5 !default; 8 | 9 | $spacing-unit: 30px !default; 10 | 11 | $text-color: #111 !default; 12 | $background-color: #fdfdfd !default; 13 | $brand-color: #2a7ae2 !default; 14 | 15 | $grey-color: #828282 !default; 16 | $grey-color-light: lighten($grey-color, 40%) !default; 17 | $grey-color-dark: darken($grey-color, 25%) !default; 18 | 19 | // Width of the content area 20 | $content-width: 800px !default; 21 | 22 | $on-palm: 600px !default; 23 | $on-laptop: 800px !default; 24 | 25 | // Use media queries like this: 26 | // @include media-query($on-palm) { 27 | // .wrapper { 28 | // padding-right: $spacing-unit / 2; 29 | // padding-left: $spacing-unit / 2; 30 | // } 31 | // } 32 | @mixin media-query($device) { 33 | @media screen and (max-width: $device) { 34 | @content; 35 | } 36 | } 37 | 38 | // Import partials. 39 | @import 40 | "minima/base", 41 | "minima/layout", 42 | "minima/syntax-highlighting" 43 | ; 44 | -------------------------------------------------------------------------------- /_includes/footer.html: -------------------------------------------------------------------------------- 1 | 47 | -------------------------------------------------------------------------------- /_includes/header.html: -------------------------------------------------------------------------------- 1 | 34 | -------------------------------------------------------------------------------- /Gemfile.lock: -------------------------------------------------------------------------------- 1 | GEM 2 | remote: https://rubygems.org/ 3 | specs: 4 | addressable (2.5.2) 5 | public_suffix (>= 2.0.2, < 4.0) 6 | colorator (1.1.0) 7 | ffi (1.9.18) 8 | forwardable-extended (2.6.0) 9 | jekyll (3.4.3) 10 | addressable (~> 2.4) 11 | colorator (~> 1.0) 12 | jekyll-sass-converter (~> 1.0) 13 | jekyll-watch (~> 1.1) 14 | kramdown (~> 1.3) 15 | liquid (~> 3.0) 16 | mercenary (~> 0.3.3) 17 | pathutil (~> 0.9) 18 | rouge (~> 1.7) 19 | safe_yaml (~> 1.0) 20 | jekyll-feed (0.9.2) 21 | jekyll (~> 3.3) 22 | jekyll-sass-converter (1.5.1) 23 | sass (~> 3.4) 24 | jekyll-seo-tag (2.4.0) 25 | jekyll (~> 3.3) 26 | jekyll-watch (1.5.1) 27 | listen (~> 3.0) 28 | kramdown (1.16.2) 29 | liquid (3.0.6) 30 | listen (3.1.5) 31 | rb-fsevent (~> 0.9, >= 0.9.4) 32 | rb-inotify (~> 0.9, >= 0.9.7) 33 | ruby_dep (~> 1.2) 34 | mercenary (0.3.6) 35 | minima (2.1.1) 36 | jekyll (~> 3.3) 37 | pathutil (0.16.1) 38 | forwardable-extended (~> 2.6) 39 | public_suffix (3.0.1) 40 | rb-fsevent (0.10.2) 41 | rb-inotify (0.9.10) 42 | ffi (>= 0.5.0, < 2) 43 | rouge (1.11.1) 44 | ruby_dep (1.5.0) 45 | safe_yaml (1.0.4) 46 | sass (3.5.5) 47 | sass-listen (~> 4.0.0) 48 | sass-listen (4.0.0) 49 | rb-fsevent (~> 0.9, >= 0.9.4) 50 | rb-inotify (~> 0.9, >= 0.9.7) 51 | 52 | PLATFORMS 53 | ruby 54 | 55 | DEPENDENCIES 56 | jekyll (= 3.4.3) 57 | jekyll-feed (~> 0.6) 58 | jekyll-seo-tag 59 | minima (~> 2.0) 60 | tzinfo-data 61 | 62 | RUBY VERSION 63 | ruby 2.4.3p205 64 | 65 | BUNDLED WITH 66 | 1.16.1 67 | -------------------------------------------------------------------------------- /_includes/head.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | 5 | 6 | {% if page.title %}{{ page.title | escape }}{% else %}{{ site.title | escape }}{% endif %} 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | {% if site.google_analytics %} 32 | {% include google-analytics.html %} 33 | {% endif %} 34 | 35 | {% if page.mathjax %} 36 | 41 | 44 | {% endif %} 45 | 46 | -------------------------------------------------------------------------------- /_posts/2018-01-10-my-title.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "My Title" 4 | description: "Description (less than 160 characters for seo)" 5 | excerpt: "Include here a short excerpt of the lecture Include here a short excerpt of the lecture" 6 | author: "Guillaume" 7 | date: 2018-01-10 8 | mathjax: true 9 | published: true 10 | tags: tensorflow NLP 11 | github: https://github.com/ 12 | module: Lectures 13 | --- 14 | 15 | 16 | 17 | Use markdow to write things. You can visit [the docs](https://github.com/jekyll/minima) for more information. Here is a summary of useful commands. To add a new note, create a new `md` file under the `_posts` directory. Images should be put into `assets/my-title`. 18 | 19 | ## heading h2 20 | ### heading h3 21 | #### heading h4 22 | 23 | Lists 24 | - first item 25 | - second item 26 | 27 | or 28 | 29 | 1. first element 30 | 10. second element 31 | 32 | 33 | You can use quotes 34 | 35 | > This is a quote 36 | 37 | 38 | For __bold__ and *italics*. 39 | 40 | 41 | To insert code use 42 | 43 | ```python 44 | def my_function(x): 45 | return x 46 | 47 | ``` 48 | 49 | 50 | To use maths inline $ (x + y)^2 $ or 51 | 52 | $$ (x + y)^2 $$ 53 | 54 | 55 | You can include images (place your images in the `assets/my-title/` directory) with 56 | 57 | {% include image.html url="/assets/my-title/img2latex_encoder.svg" description="Convolutional Encoder - produces a sequence of vectors" size="80%" %} 58 | 59 | 60 | You can also include double images 61 | 62 | 63 | {% include double-image.html 64 | url1="/assets/my-title/img2latex_encoder.svg" caption1="caption 1" 65 | url2="/assets/my-title/img2latex_encoder.svg" caption2="caption 2" 66 | size="70%" 67 | description="Global caption" %} 68 | 69 | 70 | 71 | To create beautiful illustrations I recommend using the `svg` format and the webiste [lucidchart](https://www.lucidchart.com). 72 | 73 | 74 | You can include html like 75 | 76 |

    > Go home

    77 | -------------------------------------------------------------------------------- /_posts/2018-02-01-aws-starter-guide.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "AWS setup" 4 | description: "How to set up AWS for deep learning projects" 5 | excerpt: "How to set up AWS for deep learning projects" 6 | author: "Teaching assistant Russell Kaplan" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: tensorflow pytorch 11 | github: https://github.com/cs230-stanford/cs230-code-examples 12 | module: Tutorials 13 | --- 14 | 15 | __Table of Content__ 16 | 17 | * TOC 18 | {:toc} 19 | 20 | 21 | --- 22 | 23 | ### Get your AWS credits 24 | 25 | For this winter 2018 session, AWS is offering GPU credits for CS230 students. If no one on your team has requested AWS credit yet, please follow the instructions on the AWS piazza post to get your credits. 26 | 27 | ### Create a Deep Learning EC2 instance 28 | 29 | Follow Amazon's [getting started guide][aws-tutorial] for creating a __Deep Learning instance__. Be sure to pick the Ubuntu version of the deep learning Amazon Machine Images (AMI) at the third screen. For the instance type, we recommend using __p2.xlarge__. This is available in the __US East (Northern Virginia)__ region (it's not available in Northern California). Follow the instructions to SSH into the instance. 30 | 31 | **IMPORTANT**: Be sure to not forget to **turn off your instance** when you are not using it! If you leave it running, you will be billed continuously for the hours it is left on and you will run out of credits very quickly. 32 | 33 | 34 | 35 | ### Clone the project code examples 36 | 37 | It's not required to base your project on the Project Code Examples, but it might be helpful. (Some of you might be using existing code from another GitHub repo instead, for example.) 38 | For an introduction to the code examples, see [our tutorial][post-1]. To clone, run this command inside your SSH session with the amazon server: 39 | ``` 40 | git clone https://github.com/cs230-stanford/cs230-code-examples.git 41 | ``` 42 | 43 | 44 | ### Start training 45 | 46 | You're ready to start training! Follow the instructions in the [project tutorial][post-1] to start training a model. We're excited about the amazing things you will build! 47 | 48 | 49 | 50 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 51 | [aws-tutorial]: https://aws.amazon.com/blogs/machine-learning/get-started-with-deep-learning-using-the-aws-deep-learning-ami/ 52 | -------------------------------------------------------------------------------- /_layouts/home.html: -------------------------------------------------------------------------------- 1 | 2 | 3 | 4 | {% include head.html %} 5 | {% seo %} 6 | 7 | 8 | 9 | {% include header.html %} 10 | 11 |
    12 |
    13 |
    14 | 15 |
    16 | {{ content }} 17 |
    18 | 19 | 20 | 26 | 27 | 28 | 29 | 30 | 31 |
    32 | Hands-on sessions 33 |
    34 | 35 |
      36 | {% include post-preview.html id="/session-3" %} 37 | {% include post-preview.html id="/session-4" %} 38 |
    39 | 40 |
    41 | Final Project 42 |
    43 |
    44 | Introduction 45 |
    46 |
      47 | {% include post-preview.html id="/project-code-examples" %} 48 | {% include post-preview.html id="/aws-starter-guide" %} 49 |
    50 | 51 |
    52 | Best practices 53 |
    54 |
      55 | {% include post-preview.html id="/train-dev-test-split" %} 56 | {% include post-preview.html id="/logging-hyperparams" %} 57 |
    58 | 59 |
    60 | TensorFlow 61 |
    62 |
      63 | {% include post-preview.html id="/tensorflow-getting-started" %} 64 | {% include post-preview.html id="/tensorflow-input-data" %} 65 | {% include post-preview.html id="/tensorflow-model" %} 66 |
    67 | 68 |
    69 | PyTorch 70 |
    71 |
      72 | {% include post-preview.html id="/pytorch-getting-started" %} 73 | {% include post-preview.html id="/pytorch-vision" %} 74 | {% include post-preview.html id="/pytorch-nlp" %} 75 |
    76 | 77 |
    78 | 79 |
    80 |
    81 | 82 | {% include footer.html %} 83 | 84 | 85 | 86 | 87 | 88 | 89 | -------------------------------------------------------------------------------- /assets/css/custom.css: -------------------------------------------------------------------------------- 1 | body { 2 | /* font-family: Helvetica, Arial, sans-serif; */ 3 | font-family: 'Roboto', sans-serif; 4 | font-size: 16px; 5 | line-height: 1.5; 6 | font-weight: 300; 7 | background-color: #fdfdfd; 8 | 9 | display: flex; 10 | min-height: 100vh; 11 | flex-direction: column; 12 | } 13 | 14 | .post { 15 | padding-top: 30px; 16 | } 17 | 18 | .home-page-content { 19 | padding-bottom: 20px; 20 | } 21 | 22 | .center-image 23 | { 24 | margin: auto; 25 | display: block; 26 | max-width: 70%; 27 | } 28 | 29 | p { 30 | text-align: justify; 31 | } 32 | 33 | li { 34 | text-align: justify; 35 | } 36 | 37 | li.post-preview { 38 | margin-bottom: 10px; 39 | overflow: hidden; 40 | padding-bottom: 10px; 41 | border-bottom:1px solid rgba(0,0,0,0.2); 42 | } 43 | 44 | li:last-child { 45 | border-bottom: none; 46 | } 47 | 48 | .title-wrapper { 49 | text-align: center; 50 | } 51 | 52 | a.post-preview { 53 | color: inherit; 54 | margin: 0px 0px; 55 | margin-right: 5px; 56 | 57 | } 58 | 59 | a.post-preview:hover div.post-preview-title { 60 | color: #B70B14; 61 | text-decoration: none; 62 | font-weight: normal; 63 | } 64 | 65 | 66 | .module-h1 { 67 | font-weight: 600; 68 | font-size: 32px; 69 | text-align: center; 70 | } 71 | 72 | .module-h2 { 73 | font-weight: 300; 74 | font-size: 25px; 75 | color: #B70B14; 76 | text-align: center; 77 | margin: 10px; 78 | } 79 | 80 | .post-preview-title { 81 | font-weight: 700; 82 | font-size: 24px; 83 | } 84 | 85 | .post-metadata { 86 | display: inline-block; 87 | float: left; 88 | padding-top: 0.5em; 89 | padding-right: 20px; 90 | 91 | } 92 | 93 | .post-date { 94 | padding-bottom: 5px; 95 | } 96 | 97 | .tags { 98 | display: inline-block; 99 | } 100 | 101 | .tag { 102 | color: rgba(0,0,0,0.67); 103 | padding: 0.1em 0.5em; 104 | margin: 0; 105 | font-size: 80%; 106 | border: 1px solid rgba(0,0,0,0.4); 107 | border-radius: 5px; 108 | margin-bottom: 5px; 109 | margin-right: 5px; 110 | display: inline-block; 111 | float: left; 112 | } 113 | 114 | @media(min-width: 768px) { 115 | .post-metadata { 116 | width: 20%; 117 | } 118 | 119 | .post-github { 120 | float: right; 121 | } 122 | 123 | .post-description { 124 | width: 65%; 125 | } 126 | } 127 | 128 | .post-description { 129 | position: relative; 130 | text-align: left; 131 | display: inline-block; 132 | vertical-align: top; 133 | } 134 | 135 | .post-github { 136 | background-color: #E1E4FF; 137 | display: inline-block; 138 | font-size: 80%; 139 | margin-top: 10px; 140 | padding: 0.1em 0.5em; 141 | border-radius: 15px; 142 | margin-bottom: 5px; 143 | border: 1px solid rgba(0,0,0,0.4); 144 | float: right; 145 | } 146 | 147 | a.post-github { 148 | color: inherit; 149 | } 150 | 151 | a.post-github:hover { 152 | color: inherit; 153 | background-color: #C7EEAA; 154 | text-decoration: none; 155 | } 156 | 157 | 158 | 159 | .site-header { 160 | background-color: #B70B14; 161 | } 162 | 163 | .site-title { 164 | color: #FFF !important; 165 | float: none; 166 | } 167 | 168 | .page-link { 169 | color: #bbb !important; 170 | } 171 | 172 | .page-content { 173 | flex: 1; 174 | } 175 | 176 | .site-footer { 177 | background-color: #EEF1F5; 178 | } 179 | -------------------------------------------------------------------------------- /_sass/minima/_syntax-highlighting.scss: -------------------------------------------------------------------------------- 1 | /** 2 | * Syntax highlighting styles 3 | */ 4 | .highlight { 5 | background: #fff; 6 | @extend %vertical-rhythm; 7 | 8 | .highlighter-rouge & { 9 | background: #eef; 10 | } 11 | 12 | .c { color: #998; font-style: italic } // Comment 13 | .err { color: #a61717; background-color: #e3d2d2 } // Error 14 | .k { font-weight: bold } // Keyword 15 | .o { font-weight: bold } // Operator 16 | .cm { color: #998; font-style: italic } // Comment.Multiline 17 | .cp { color: #999; font-weight: bold } // Comment.Preproc 18 | .c1 { color: #998; font-style: italic } // Comment.Single 19 | .cs { color: #999; font-weight: bold; font-style: italic } // Comment.Special 20 | .gd { color: #000; background-color: #fdd } // Generic.Deleted 21 | .gd .x { color: #000; background-color: #faa } // Generic.Deleted.Specific 22 | .ge { font-style: italic } // Generic.Emph 23 | .gr { color: #a00 } // Generic.Error 24 | .gh { color: #999 } // Generic.Heading 25 | .gi { color: #000; background-color: #dfd } // Generic.Inserted 26 | .gi .x { color: #000; background-color: #afa } // Generic.Inserted.Specific 27 | .go { color: #888 } // Generic.Output 28 | .gp { color: #555 } // Generic.Prompt 29 | .gs { font-weight: bold } // Generic.Strong 30 | .gu { color: #aaa } // Generic.Subheading 31 | .gt { color: #a00 } // Generic.Traceback 32 | .kc { font-weight: bold } // Keyword.Constant 33 | .kd { font-weight: bold } // Keyword.Declaration 34 | .kp { font-weight: bold } // Keyword.Pseudo 35 | .kr { font-weight: bold } // Keyword.Reserved 36 | .kt { color: #458; font-weight: bold } // Keyword.Type 37 | .m { color: #099 } // Literal.Number 38 | .s { color: #d14 } // Literal.String 39 | .na { color: #008080 } // Name.Attribute 40 | .nb { color: #0086B3 } // Name.Builtin 41 | .nc { color: #458; font-weight: bold } // Name.Class 42 | .no { color: #008080 } // Name.Constant 43 | .ni { color: #800080 } // Name.Entity 44 | .ne { color: #900; font-weight: bold } // Name.Exception 45 | .nf { color: #900; font-weight: bold } // Name.Function 46 | .nn { color: #555 } // Name.Namespace 47 | .nt { color: #000080 } // Name.Tag 48 | .nv { color: #008080 } // Name.Variable 49 | .ow { font-weight: bold } // Operator.Word 50 | .w { color: #bbb } // Text.Whitespace 51 | .mf { color: #099 } // Literal.Number.Float 52 | .mh { color: #099 } // Literal.Number.Hex 53 | .mi { color: #099 } // Literal.Number.Integer 54 | .mo { color: #099 } // Literal.Number.Oct 55 | .sb { color: #d14 } // Literal.String.Backtick 56 | .sc { color: #d14 } // Literal.String.Char 57 | .sd { color: #d14 } // Literal.String.Doc 58 | .s2 { color: #d14 } // Literal.String.Double 59 | .se { color: #d14 } // Literal.String.Escape 60 | .sh { color: #d14 } // Literal.String.Heredoc 61 | .si { color: #d14 } // Literal.String.Interpol 62 | .sx { color: #d14 } // Literal.String.Other 63 | .sr { color: #009926 } // Literal.String.Regex 64 | .s1 { color: #d14 } // Literal.String.Single 65 | .ss { color: #990073 } // Literal.String.Symbol 66 | .bp { color: #999 } // Name.Builtin.Pseudo 67 | .vc { color: #008080 } // Name.Variable.Class 68 | .vg { color: #008080 } // Name.Variable.Global 69 | .vi { color: #008080 } // Name.Variable.Instance 70 | .il { color: #099 } // Literal.Number.Integer.Long 71 | } 72 | -------------------------------------------------------------------------------- /_sass/minima/_base.scss: -------------------------------------------------------------------------------- 1 | /** 2 | * Reset some basic elements 3 | */ 4 | body, h1, h2, h3, h4, h5, h6, 5 | p, blockquote, pre, hr, 6 | dl, dd, ol, ul, figure { 7 | margin: 0; 8 | padding: 0; 9 | } 10 | 11 | 12 | 13 | /** 14 | * Basic styling 15 | */ 16 | body { 17 | font: $base-font-weight #{$base-font-size}/#{$base-line-height} $base-font-family; 18 | color: $text-color; 19 | background-color: $background-color; 20 | -webkit-text-size-adjust: 100%; 21 | -webkit-font-feature-settings: "kern" 1; 22 | -moz-font-feature-settings: "kern" 1; 23 | -o-font-feature-settings: "kern" 1; 24 | font-feature-settings: "kern" 1; 25 | font-kerning: normal; 26 | } 27 | 28 | 29 | 30 | /** 31 | * Set `margin-bottom` to maintain vertical rhythm 32 | */ 33 | h1, h2, h3, h4, h5, h6, 34 | p, blockquote, pre, 35 | ul, ol, dl, figure, 36 | %vertical-rhythm { 37 | margin-bottom: $spacing-unit / 2; 38 | } 39 | 40 | 41 | 42 | /** 43 | * Images 44 | */ 45 | img { 46 | max-width: 100%; 47 | vertical-align: middle; 48 | } 49 | 50 | 51 | 52 | /** 53 | * Figures 54 | */ 55 | figure > img { 56 | display: block; 57 | } 58 | 59 | figcaption { 60 | font-size: $small-font-size; 61 | } 62 | 63 | 64 | 65 | /** 66 | * Lists 67 | */ 68 | ul, ol { 69 | margin-left: $spacing-unit; 70 | } 71 | 72 | li { 73 | > ul, 74 | > ol { 75 | margin-bottom: 0; 76 | } 77 | } 78 | 79 | 80 | 81 | /** 82 | * Headings 83 | */ 84 | h1, h2, h3, h4, h5, h6 { 85 | font-weight: $base-font-weight; 86 | } 87 | 88 | 89 | 90 | /** 91 | * Links 92 | */ 93 | a { 94 | color: $brand-color; 95 | text-decoration: none; 96 | 97 | &:visited { 98 | color: darken($brand-color, 15%); 99 | } 100 | 101 | &:hover { 102 | color: $text-color; 103 | text-decoration: underline; 104 | } 105 | } 106 | 107 | 108 | 109 | /** 110 | * Blockquotes 111 | */ 112 | blockquote { 113 | color: $grey-color; 114 | border-left: 4px solid $grey-color-light; 115 | padding-left: $spacing-unit / 2; 116 | font-size: 18px; 117 | letter-spacing: -1px; 118 | font-style: italic; 119 | 120 | > :last-child { 121 | margin-bottom: 0; 122 | } 123 | } 124 | 125 | 126 | 127 | /** 128 | * Code formatting 129 | */ 130 | pre, 131 | code { 132 | font-size: 15px; 133 | border: 1px solid $grey-color-light; 134 | border-radius: 3px; 135 | background-color: #eef; 136 | } 137 | 138 | code { 139 | padding: 1px 5px; 140 | } 141 | 142 | pre { 143 | padding: 8px 12px; 144 | overflow-x: auto; 145 | 146 | > code { 147 | border: 0; 148 | padding-right: 0; 149 | padding-left: 0; 150 | } 151 | } 152 | 153 | 154 | 155 | /** 156 | * Wrapper 157 | */ 158 | .wrapper { 159 | max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit} * 2)); 160 | max-width: calc(#{$content-width} - (#{$spacing-unit} * 2)); 161 | margin-right: auto; 162 | margin-left: auto; 163 | padding-right: $spacing-unit; 164 | padding-left: $spacing-unit; 165 | @extend %clearfix; 166 | 167 | @include media-query($on-laptop) { 168 | max-width: -webkit-calc(#{$content-width} - (#{$spacing-unit})); 169 | max-width: calc(#{$content-width} - (#{$spacing-unit})); 170 | padding-right: $spacing-unit / 2; 171 | padding-left: $spacing-unit / 2; 172 | } 173 | } 174 | 175 | 176 | 177 | /** 178 | * Clearfix 179 | */ 180 | %clearfix:after { 181 | content: ""; 182 | display: table; 183 | clear: both; 184 | } 185 | 186 | 187 | 188 | /** 189 | * Icons 190 | */ 191 | .icon > svg { 192 | display: inline-block; 193 | vertical-align: middle; 194 | 195 | path { 196 | fill: $grey-color; 197 | } 198 | } 199 | -------------------------------------------------------------------------------- /_posts/2018-06-02-session-3.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Introduction to Tensorflow" 4 | description: "" 5 | excerpt: "Session 3" 6 | author: "Kian Katanforoosh, Andrew Ng" 7 | date: 2018-06-02 8 | mathjax: true 9 | published: true 10 | tags: tensorflow 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 12 | module: Tutorials 13 | --- 14 | 15 | In this hands-on session, you will use two files: 16 | - Tensorflow_tutorial.py (Part I) 17 | - [CS230 project example code](https://github.com/cs230-stanford/cs230-code-examples) repository on github (Part II) 18 | 19 | ## Part I - Tensorflow Tutorial 20 | 21 | The goal of this part is to quickly build a tensorflow code implementing a Neural Network to classify hand digits from the MNIST dataset. 22 | 23 | The steps you are going to implement are: 24 | - Load the dataset 25 | - Define placeholders 26 | - Define parameters of your model 27 | - Define the model’s graph (including the cost function) 28 | - Define your accuracy metric 29 | - Define the optimization method and the training step 30 | - Initialize the tensorflow graph 31 | - Optimize (loop) 32 | - Compute training and testing accuracies 33 | 34 | **Question 1:** ​Open the starter code “tensorflow_tutorial.py”. Tensorflow stores the MNIST dataset in one of its dependencies called “tensorflow.examples.tutorials.mnist”. This part is very specific to MNIST so we have coded it for you. Please read the code that loads MNIST. 35 | 36 | **Question 2:** Define the tensorflow placeholders X (data) and Y (labels). Recall that the data is stored in 28x28 grayscale images, and the labels are between 0 and 9 37 | 38 | **Question 3:** For now, we are going to implement a very simple 2-layer neural network *(LINEAR->RELU->LINEAR->SOFTMAX)*. Define the parameters of your model in tensorflow. Make sure your shapes match. 39 | 40 | **Question 4:** Using the parameters defined in question (3), implement the forward propagation (from the input X to the output probabilities A). Don’t forget to reshape your input, as you are using a fully-connected neural network. 41 | 42 | **Question 5:** Recall that this is a 10-class classification task. What cost function should you use? Implement your cost function. 43 | 44 | **Question 6:** What accuracy metric should you use? Implement your accuracy metric. 45 | 46 | **Question 7:** Define the tensorflow optimizer you want to use, and the tensorflow training step. Running the training step in the tensorflow graph will perform one optimization step. 47 | 48 | **Question 8:** As usual in tensorflow, you need to initialize the variables of the graph, create the tensorflow session and run the initializer on the session. Write code to do these steps. 49 | 50 | **Question 9:** Implement the optimization loop for 20,000 steps. At every step, have to: 51 | - Load the mini-batch of MNIST data (including images and labels) 52 | - Create a feed dictionary to assign your placeholders to the data. 53 | - Run the session defined above on the correct graph nodes to perform an optimization step and access the desired values of the graph. 54 | - Print the cost and iteration number. 55 | 56 | **Question 10:** Using your accuracy metric, compute the accuracy and the value of the cost function both on the train and test set. 57 | 58 | Run the code from your terminal using: *“python tensorflow_tutorial.py”* 59 | 60 | **Question 11:** Look at the outputs, accuracy and logs of your model. What improvements could be made? Take time at home to play with your code, and search for ideas online 61 | 62 | ## Part II - Project Code Examples 63 | 64 | The goal of this part is to become more familiar with the [CS230 project example code](https://github.com/cs230-stanford/cs230-code-examples) that the 65 | teaching staff has provided. It’s meant to help you prototype ideas for your projects. 66 | 67 | **Question 1:** Please start by git cloning the “[cs230-code-examples](https://github.com/cs230-stanford/cs230-code-examples)” repository on your local computer. 68 | 69 | This repository contains several project examples: 70 | - Tensorflow vision (SIGNS dataset classification) 71 | - Tensorflow nlp (Named Entity Recognition) 72 | - Pytorch vision (SIGNS dataset classification) 73 | - Pytorch nlp (Named Entity Recognition) 74 | 75 | **Question 2:** You will start with the Tensorflow vision project example. On your terminal, and inside the cloned repository, navigate to *./tensorflow/vision*. Follow the instructions described in the “Requirements” section of the readme. 76 | 77 | **Question 3:** Follow the guidelines described in the “Dowload the SIGNS dataset". 78 | 79 | **Question 4:** Follow the guidelines described in the “Quickstart”. 80 | 81 | **Question 5:** Read through the code and find what you could modify. All the project examples are structured the same way, we invite you to try the one that matches your needs the best. This project example code has been coded to help you insert your dataset quickly, and start prototyping some results. It is meant to be a helper code for your project and for you to learn in-depth Tensorflow/Pytorch, it is not a starter code. 82 | 83 | 84 | 85 | 86 | -------------------------------------------------------------------------------- /_sass/minima/_layout.scss: -------------------------------------------------------------------------------- 1 | /** 2 | * Site header 3 | */ 4 | .site-header { 5 | /*border-top: 5px solid $grey-color-dark;*/ 6 | border-bottom: 1px solid $grey-color-light; 7 | min-height: 56px; 8 | 9 | // Positioning context for the mobile navigation icon 10 | position: relative; 11 | } 12 | 13 | .site-title { 14 | font-size: 26px; 15 | font-weight: 300; 16 | line-height: 56px; 17 | letter-spacing: -1px; 18 | margin-bottom: 0; 19 | float: left; 20 | 21 | &, 22 | &:visited { 23 | color: $grey-color-dark; 24 | } 25 | } 26 | 27 | .site-nav { 28 | float: right; 29 | line-height: 56px; 30 | 31 | .menu-icon { 32 | display: none; 33 | } 34 | 35 | .page-link { 36 | color: $text-color; 37 | line-height: $base-line-height; 38 | 39 | // Gaps between nav items, but not on the last one 40 | &:not(:last-child) { 41 | margin-right: 20px; 42 | } 43 | } 44 | 45 | @include media-query($on-palm) { 46 | position: absolute; 47 | top: 9px; 48 | right: $spacing-unit / 2; 49 | background-color: $background-color; 50 | border: 1px solid $grey-color-light; 51 | border-radius: 5px; 52 | text-align: right; 53 | 54 | .menu-icon { 55 | display: block; 56 | float: right; 57 | width: 36px; 58 | height: 26px; 59 | line-height: 0; 60 | padding-top: 10px; 61 | text-align: center; 62 | 63 | > svg path { 64 | fill: $grey-color-dark; 65 | } 66 | } 67 | 68 | .trigger { 69 | clear: both; 70 | display: none; 71 | } 72 | 73 | .show { 74 | display: block; 75 | padding-bottom: 5px; 76 | } 77 | 78 | 79 | 80 | .page-link { 81 | display: block; 82 | padding: 5px 10px; 83 | 84 | &:not(:last-child) { 85 | margin-right: 0; 86 | } 87 | margin-left: 20px; 88 | } 89 | } 90 | } 91 | 92 | 93 | 94 | /** 95 | * Site footer 96 | */ 97 | .site-footer { 98 | border-top: 1px solid $grey-color-light; 99 | padding: $spacing-unit 0; 100 | } 101 | 102 | .footer-heading { 103 | font-size: 18px; 104 | margin-bottom: $spacing-unit / 2; 105 | } 106 | 107 | .contact-list, 108 | .social-media-list { 109 | list-style: none; 110 | margin-left: 0; 111 | } 112 | 113 | .footer-col-wrapper { 114 | font-size: 15px; 115 | color: $grey-color; 116 | margin-left: -$spacing-unit / 2; 117 | @extend %clearfix; 118 | } 119 | 120 | .footer-col { 121 | float: left; 122 | margin-bottom: $spacing-unit / 2; 123 | padding-left: $spacing-unit / 2; 124 | } 125 | 126 | .footer-col-1 { 127 | width: -webkit-calc(35% - (#{$spacing-unit} / 2)); 128 | width: calc(35% - (#{$spacing-unit} / 2)); 129 | } 130 | 131 | .footer-col-2 { 132 | width: -webkit-calc(30% - (#{$spacing-unit} / 2)); 133 | width: calc(30% - (#{$spacing-unit} / 2)); 134 | text-align: center; 135 | } 136 | 137 | .footer-col-3 { 138 | width: -webkit-calc(35% - (#{$spacing-unit} / 2)); 139 | width: calc(35% - (#{$spacing-unit} / 2)); 140 | text-align: right; 141 | } 142 | 143 | @media only screen and (max-width: 768px) { 144 | .footer-col-2 { 145 | text-align: left; 146 | } 147 | .footer-col-3 { 148 | text-align: left; 149 | } 150 | } 151 | 152 | @include media-query($on-laptop) { 153 | .footer-col-1, 154 | .footer-col-2 { 155 | width: -webkit-calc(50% - (#{$spacing-unit} / 2)); 156 | width: calc(50% - (#{$spacing-unit} / 2)); 157 | } 158 | 159 | .footer-col-3 { 160 | width: -webkit-calc(100% - (#{$spacing-unit} / 2)); 161 | width: calc(100% - (#{$spacing-unit} / 2)); 162 | } 163 | } 164 | 165 | @include media-query($on-palm) { 166 | .footer-col { 167 | float: none; 168 | width: -webkit-calc(100% - (#{$spacing-unit} / 2)); 169 | width: calc(100% - (#{$spacing-unit} / 2)); 170 | } 171 | } 172 | 173 | 174 | 175 | /** 176 | * Page content 177 | */ 178 | .page-content { 179 | padding: $spacing-unit 0; 180 | } 181 | 182 | .page-heading { 183 | font-size: 20px; 184 | } 185 | 186 | .post-list { 187 | margin-left: 0; 188 | list-style: none; 189 | 190 | > li { 191 | margin-bottom: $spacing-unit; 192 | } 193 | } 194 | 195 | .post-meta { 196 | font-size: $small-font-size; 197 | color: $grey-color; 198 | } 199 | 200 | .post-link { 201 | display: block; 202 | font-size: 24px; 203 | } 204 | 205 | 206 | 207 | /** 208 | * Posts 209 | */ 210 | .post-header { 211 | margin-bottom: $spacing-unit; 212 | } 213 | 214 | .post-title { 215 | font-size: 42px; 216 | letter-spacing: -1px; 217 | line-height: 1; 218 | 219 | @include media-query($on-laptop) { 220 | font-size: 36px; 221 | } 222 | } 223 | 224 | .post-content { 225 | margin-bottom: $spacing-unit; 226 | 227 | h2 { 228 | font-size: 32px; 229 | 230 | @include media-query($on-laptop) { 231 | font-size: 28px; 232 | } 233 | } 234 | 235 | h3 { 236 | font-size: 26px; 237 | 238 | @include media-query($on-laptop) { 239 | font-size: 22px; 240 | } 241 | } 242 | 243 | h4 { 244 | font-size: 20px; 245 | 246 | @include media-query($on-laptop) { 247 | font-size: 18px; 248 | } 249 | } 250 | } 251 | -------------------------------------------------------------------------------- /_posts/2018-02-01-train-dev-test-split.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Splitting into train, dev and test sets" 4 | description: "Short tutorial detailing the best practices to split your dataset into train, dev and test sets" 5 | excerpt: "Best practices to split your dataset into train, dev and test sets" 6 | author: "Teaching assistants Olivier Moindrot and Guillaume Genthial" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: best-practice 11 | github: https://github.com/cs230-stanford/cs230-code-examples 12 | module: Tutorials 13 | --- 14 | 15 | Splitting your data into training, dev and test sets can be disastrous if not done correctly. 16 | In this short tutorial, we will explain the best practices when splitting your dataset. 17 | 18 | This post follows part 3 of the class on ["Structuring your Machine Learning Project"][coursera], and adds code examples to the theoretical content. 19 | 20 | This tutorial is among a series of tutorials explaining how to structure a deep learning project. Please see the full list of posts on the [main page][main]. 21 | 22 | __Table of Content__ 23 | 24 | * TOC 25 | {:toc} 26 | 27 | 28 | --- 29 | ## Theory: how to choose the train, train-dev, dev and test sets 30 | 31 | _Please refer to the [course content][coursera] for a full overview._ 32 | 33 | Setting up the training, development (dev) and test sets has a huge impact on productivity. It is important to choose the dev and test sets from the __same distribution__ and it must be taken randomly from all the data. 34 | 35 | __Guideline__: Choose a dev set and test set to reflect data you expect to get in the future. 36 | 37 | The size of the dev and test set should be big enough for the dev and test results to be representative of the performance of the model. If the dev set has 100 examples, the dev accuracy can vary a lot depending on the chosen dev set. For bigger datasets (>1M examples), the dev and test set can have around 10,000 examples each for instance (only 1% of the total data). 38 | 39 | __Guideline__: The dev and test sets should be just big enough to represent accurately the performance of the model 40 | 41 | If the training set and dev sets have different distributions, it is good practice to introduce a __train-dev set__ that has the same distribution as the training set. This train-dev set will be used to measure how much the model is overfitting. Again, refer to the [course content][coursera] for a full overview. 42 | 43 | 44 | ### Objectives in practice 45 | 46 | These guidelines translate into best practices for code: 47 | 48 | - the split between train / dev / test should __always be the same__ across experiments 49 | - otherwise, different models are not evaluated in the same conditions 50 | - we should have a __reproducible script__ to create the train / dev / test split 51 | - we need to test if the __dev__ and __test__ sets should come from the same distribution 52 | 53 | 54 | --- 55 | ## Have a reproducible script 56 | 57 | The best and most secure way to split the data into these three sets is to have one directory for train, one for dev and one for test. 58 | 59 | For instance if you have a dataset of images, you could have a structure like this with 80% in the training set, 10% in the dev set and 10% in the test set. 60 | ``` 61 | data/ 62 | train/ 63 | img_000.jpg 64 | ... 65 | img_799.jpg 66 | dev/ 67 | img_800.jpg 68 | ... 69 | img_899.jpg 70 | test/ 71 | img_900.jpg 72 | ... 73 | img_999.jpg 74 | ``` 75 | 76 | #### Build it in a reproducible way 77 | 78 | Often a dataset will come either in one big set that you will split into train, dev and test. Academic datasets often come already with a train/test split (to be able to compare different models on a common test set). You will therefore have to build yourself the train/dev split before beginning your project. 79 | 80 | A good practice that is true for every software, but especially in machine learning, is to make every step of your project reproducible. 81 | It should be possible to start the project again from scratch and create the same exact split between train, dev and test sets. 82 | 83 | The cleanest way to do it is to have a `build_dataset.py` file that will be called once at the start of the project and will create the split into train, dev and test. Optionally, calling `build_dataset.py` can also download the dataset. 84 | We need to make sure that any randomness involved in `build_dataset.py` uses a __fixed seed__ so that every call to `python build_dataset.py` will result in the same output. 85 | 86 | >Never do the split manually (by moving files into different folders one by one), because you wouldn't be able to reproduce it. 87 | 88 | _An example `build_dataset.py` file is the one used [here][build-dataset] in the vision example project._ 89 | 90 | --- 91 | ## Details of implementation 92 | 93 | Let's illustrate the good practices with a simple example. We have filenames of images that we want to split into train, dev and test. 94 | Here is a way to split the data into three sets: 80% train, 10% dev and 10% test. 95 | ```python 96 | filenames = ['img_000.jpg', 'img_001.jpg', ...] 97 | 98 | split_1 = int(0.8 * len(filenames)) 99 | split_2 = int(0.9 * len(filenames)) 100 | train_filenames = filenames[:split_1] 101 | dev_filenames = filenames[split_1:split_2] 102 | test_filenames = filenames[split_2:] 103 | ``` 104 | 105 | #### Ensure that train, dev and test have the same distribution if possible 106 | 107 | Often we have a big dataset and want to split it into train, dev and test set. In most cases, each split will have the same distribution as the others. 108 | 109 | __What could go wrong?__ Suppose that the first 100 images (`img_000.jpg` to `img_099.jpg`) have label 0, the 100 following label 1, ... and the last 100 images have label 9. Then the above code will make the dev set only have label 8, and the test set only label 9. 110 | 111 | We therefore need to ensure that the filenames are correctly shuffled before splitting the data. 112 | ```python 113 | filenames = ['img_000.jpg', 'img_001.jpg', ...] 114 | random.shuffle(filenames) # randomly shuffles the ordering of filenames 115 | 116 | split_1 = int(0.8 * len(filenames)) 117 | split_2 = int(0.9 * len(filenames)) 118 | train_filenames = filenames[:split_1] 119 | dev_filenames = filenames[split_1:split_2] 120 | test_filenames = filenames[split_2:] 121 | ``` 122 | 123 | This should give approximately the same distribution for train, dev and test sets. If necessary, it is also possible to split each class into 80%/10%/10% so that the distribution is the same in each set. 124 | 125 | 126 | #### Make it reproducible 127 | 128 | We talked earlier about making the script reproducible. 129 | Here we need to make sure that the train/dev/test split stays the same across every run of `python build_dataset.py`. 130 | 131 | The code above doesn't ensure reproducibility, since each time you run it you will have a different split. 132 | >To make sure to have the same split each time this code is run, we need to fix the random seed before shuffling the filenames: 133 | 134 | Here is a good way to remove any randomness in the process: 135 | ```python 136 | filenames = ['img_000.jpg', 'img_001.jpg', ...] 137 | filenames.sort() # make sure that the filenames have a fixed order before shuffling 138 | random.seed(230) 139 | random.shuffle(filenames) # shuffles the ordering of filenames (deterministic given the chosen seed) 140 | 141 | split_1 = int(0.8 * len(filenames)) 142 | split_2 = int(0.9 * len(filenames)) 143 | train_filenames = filenames[:split_1] 144 | dev_filenames = filenames[split_1:split_2] 145 | test_filenames = filenames[split_2:] 146 | ``` 147 | 148 | The call to `filenames.sort()` makes sure that if you build `filenames` in a different way, the output is still the same. 149 | 150 | 151 | ### References 152 | - [course content][coursera] 153 | - [CS230 code examples][github] 154 | 155 | 156 | [main]: https://cs230-stanford.github.io/ 157 | [coursera]: https://www.coursera.org/learn/machine-learning-projects 158 | [github]: https://github.com/cs230-stanford/cs230-starter-code 159 | 160 | [build-dataset]: https://github.com/cs230-stanford/cs230-code-examples/blob/master/tensorflow/vision/build_dataset.py 161 | -------------------------------------------------------------------------------- /_posts/2018-02-01-logging-hyperparams.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Logging and Hyperparameters" 4 | description: "Best practices to log, load hyperparameters and do random search" 5 | excerpt: "Best practices to log, load hyperparameters and do random search" 6 | author: "Teaching assistants Guillaume Genthial and Olivier Moindrot" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: best-practice 11 | github: https://github.com/cs230-stanford/cs230-code-examples 12 | module: Tutorials 13 | --- 14 | 15 | Logging your outputs to a file is a general good practice in any project. 16 | An even more important good practice is to handle correctly the multiple hyperparameters that arise in any deep learning project. We need to be able to store them in a file and know the full set of hyperparameters used in any past experiment. 17 | 18 | This tutorial is among a series of tutorials explaining how to structure a deep learning project. 19 | Please see the full list of posts on the [main page][main]. 20 | 21 | 22 | __Table of Content__ 23 | 24 | * TOC 25 | {:toc} 26 | 27 | ## Logging 28 | 29 | A common problem when building a project is to forget about logging. In other words, as long as you write stuff in files and print things to the shell, people assume they're going to be fine. A better practice is to write __everything__ that you print to the terminal in a `log` file. 30 | 31 | That's why in `train.py` and `evaluate.py` we initialize a `logger` using the built-in `logging` package with: 32 | 33 | ```python 34 | # Set the logger to write the logs of the training in train.log 35 | set_logger(os.path.join(args.model_dir, 'train.log')) 36 | ``` 37 | The `set_logger` function is defined in `utils.py`. 38 | 39 | For instance during training this line of code will create a `train.log` file in `experiments/base_model/`. 40 | You don't have to worry too much about how we set it. 41 | Whenever you want to print somehting, use `logging.info` instead of the usual `print`: 42 | 43 | ```python 44 | logging.info("It will be printed both to the Terminal and written in the .log file") 45 | ``` 46 | 47 | 48 | That way, you'll be able to both see it in the Terminal and remember it in the future when you'll need to read the `train.log` file. 49 | 50 | 51 | ## Loading hyperparameters from a configuration file 52 | 53 | You'll quickly realize when doing a final project or any research project that you'll need a way to specify some parameters to your model. You have different sorts of hyperparameters (not all of them are necessary): 54 | - hyperparameters for the model: number of layers, number of neurons per layer, activation functions, dropout rate... 55 | - hyperparameters for the training: number of epochs, learning rate, ... 56 | - dataset choices: size of the dataset, size of the vocabulary for text, ... 57 | - checkpoints: when to save the model, when to log to plot the loss, ... 58 | 59 | 60 | There are multiple ways to load the hyperparameters: 61 | 62 | 1. Use the `argparse` module as we do to specify the `data_dir`: 63 | ```python 64 | parser.add_argument('--data_dir', default='data/', help="Directory containing the dataset") 65 | ``` 66 | When experimenting, you need to try multiples combinations of hyperparameters. This quickly becomes unmanageable because you cannot keep track of the hyperparameters you are testing . Plus, how do you even keep track of the parameters if you want to go back to a previous experiment ? 67 | 68 | 2. Hard-code the values of your hyperparameters in a new `params.py` file and import at the beginning of your `train.py` file for instance, get these hyperparameters. Again, you'll need to find a way to save your config, and this is not very clean. 69 | 70 | 3. Write all your parameters in a file (we used `.json` but could be anything else) and store this file in the directory containing your experiment. 71 | If you need to go back to your experiment later, you can quickly review which hyperparameters yielded the performance etc. 72 | 73 | We chose to take this third approach in our code. We define a class `Params` in `utils.py`. Note that to be in accordance with the deep learning programming frameworks we use, we are refering to hyperparameters as `params` in the code. 74 | 75 | Loading the hyperparameters is as simple as writing 76 | 77 | ```python 78 | params = Params("experiments/base_model/params.json") 79 | ``` 80 | 81 | 82 | and if your `params.json` file looks like 83 | 84 | ```json 85 | { 86 | "model_version": "baseline", 87 | 88 | "learning_rate": 1e-3, 89 | "batch_size": 32, 90 | "num_epochs": 10 91 | } 92 | ``` 93 | 94 | 95 | you'll be able to access the different entries with 96 | 97 | ```python 98 | params.model_version 99 | ``` 100 | > In your code, once your params object is initialized, you can update it with another `.json` file with the `params.update("other_params.json")` method. 101 | 102 | Later, in your code, for example when you define your model, you can thus do something like 103 | 104 | ```python 105 | if params.model_version == "baseline": 106 | logits = build_model_baseline(inputs, params) 107 | elif params.model_version == "simple_convolutions": 108 | logits = bulid_model_simple_convolutions(inputs, params) 109 | ``` 110 | 111 | which will be quite handy to have different functions and behaviors depending on a set of hyperparameters ! 112 | 113 | 114 | ## Hyperparameter search 115 | 116 | An important part of any machine learning project is hyperparameter tuning, please refer to the Coursera Deep Learning Specialization ([#2][course2] and [#3][course3]) for more detailed information. In other words, you want to see how your model performs on the development set on different sets of hyperparameters. There are basically 2 ways to implement this: 117 | 118 | 1. Have a python loop over the different set of hyperparameters and at each iteration of the loop, run the `train_and_evaluate(model_spec, params, ...)` function, like 119 | ```python 120 | for lr in [0.1, 0.01, 0.001]: 121 | params.learning_rate = lr 122 | train_and_evaluate(model_spec, params, ...) 123 | ``` 124 | 125 | 2. Have a more general script that will create a subfolder for each set of hyperparameteres and launch a training job using the `python train.py` command. While there is not much difference in the simplest setting, some more advanced clusters have some job managers and instead of running multiple `python train.py`, they instead do something like `job-manager-submit train.py` which will run the jobs concurrently, making the hyperparameter tuning much faster ! 126 | ```python 127 | for lr in [0.1, 0.01, 0.001]: 128 | params.learning_rate = lr 129 | # Create new experiment directory and save the relevant params.json 130 | subfolder = create_subfolder("lr_{}".format(lr)) 131 | export_params_to_json(params, subfolder) 132 | # Launch a training in this directory -- it will call `train.py` 133 | lauch_training_job(model_dir=subfolder, ...) 134 | ``` 135 | 136 | This is what the `search_hyperparams.py` file does. It is basically a python script that runs other python scripts. Once all the sub-jobs have ended, you'll have the results of each experiment in a `metrics_eval_best_weights.json` file for each experiment directory. 137 | 138 | ``` 139 | learning_rate/ 140 | hyperparams.json 141 | learning_rate_0.1/ 142 | hyperparams.json 143 | metrics_eval_best_weights.json 144 | learning_rate_0.01/ 145 | hyperparams.json 146 | metrics_eval_best_weights.json 147 | ``` 148 | 149 | 150 | and by running `python synthesize_results.py --model_dir experiments/learning_rate` you'll be able to gather the different metrics achieved for the different sets of hyperparameters ! 151 | 152 | From one experiment to another, it is very important to test hyperparameters one at a time. Comparing the dev-set performance of two models "A" and "B" which have a totally different set of hyperparameters will probably lead to wrong decisions. You need to vary only ONE hyperparameter (let's say the learning rate) when comparing models "A" and "B". Then, you can see the impact of this change on the dev-set performance. 153 | 154 | 155 | [main]: https://cs230-stanford.github.io 156 | [github]: https://github.com/cs230-stanford/cs230-code-examples 157 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 158 | [tf-post]: https://cs230-stanford.github.io/tensorflow-psp.html 159 | [tf-data]: https://cs230-stanford.github.io/tensorflow-input-data.html 160 | [course2]: https://www.coursera.org/learn/deep-neural-network 161 | [course3]: https://www.coursera.org/learn/machine-learning-projects 162 | -------------------------------------------------------------------------------- /_posts/2018-02-01-tensorflow-getting-started.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Introduction to Tensorflow" 4 | description: "Graph, Session, Nodes and variable scope" 5 | excerpt: "Graph, Session, Nodes and variable scope" 6 | author: "Teaching assistants Guillaume Genthial and Olivier Moindrot" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: tensorflow 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 12 | module: Tutorials 13 | --- 14 | 15 | This post follows the [main post][post-1] announcing the release of the CS230 code examples. 16 | We will explain here the TensorFlow part of the code, in our [github repository][github]. 17 | 18 | ``` 19 | tensorflow/ 20 | vision/ 21 | nlp/ 22 | ``` 23 | 24 | This tutorial is among a series explaining how to structure a deep learning project: 25 | - [first post][post-1]: installation, get started with the code for the projects 26 | - __this post: (TensorFlow) explain the global structure of the code__ 27 | - [third post][post-3]: (TensorFlow) how to build the data pipeline 28 | - [fourth post][post-4]: (Tensorflow) how to build the model and train it 29 | 30 | 31 | __Goals of this tutorial__ 32 | - learn more about TensorFlow 33 | - learn an example of how to correctly structure a deep learning project in TensorFlow 34 | - fully understand the code to be able to use it for your own projects 35 | 36 | __Table of Content__ 37 | 38 | * TOC 39 | {:toc} 40 | 41 | --- 42 | 43 | ### Resources 44 | 45 | For an official __introduction__ to the Tensorflow concepts of `Graph()` and `Session()`, check out the [official introduction on tensorflow.org](https://www.tensorflow.org/get_started/get_started#tensorflow_core_tutorial). 46 | 47 | For a __simple example on MNIST__, read [the official tutorial](https://www.tensorflow.org/get_started/mnist/beginners), but keep in mind that some of the techniques are not recommended for big projects (they use `placeholders` instead of the new `tf.data` pipeline, they don't use `tf.layers`, etc.). 48 | 49 | For a more __detailed tour__ of Tensorflow, reading the [programmer's guide](https://www.tensorflow.org/programmers_guide/) is definitely worth the time. You'll learn more about Tensors, Variables, Graphs and Sessions, as well as the saving mechanism or how to import data. 50 | 51 | For a __more advanced use__ with concrete examples and code, we recommend reading [the relevant tutorials](https://www.tensorflow.org/tutorials/) for your project. You'll find good code and explanations, going from [sequence-to-sequence in Tensorflow](https://www.tensorflow.org/tutorials/seq2seq) to an [introduction to TF layers for convolutionnal Neural Nets](https://www.tensorflow.org/tutorials/layers#getting_started). 52 | 53 | You might also be interested in [Stanford's CS20 class: Tensorflow for Deep Learning Research](http://web.stanford.edu/class/cs20si/) and its [github repo](https://github.com/chiphuyen/stanford-tensorflow-tutorials) containing some cool examples. 54 | 55 | ### Structure of the code 56 | 57 | The code for each Tensorflow example shares a common structure: 58 | ``` 59 | data/ 60 | experiments/ 61 | model/ 62 | input_fn.py 63 | model_fn.py 64 | utils.py 65 | training.py 66 | evaluation.py 67 | train.py 68 | search_hyperparams.py 69 | synthesize_results.py 70 | evaluate.py 71 | ``` 72 | 73 | Here is each `model/` file purpose: 74 | - `model/input_fn.py`: where you define the input data pipeline 75 | - `model/model_fn.py`: creates the deep learning model 76 | - `model/utils.py`: utility functions for handling hyperparams / logging 77 | - `model/training.py`: utility functions to train a model 78 | - `model/evaluation.py`: utility functions to evaluate a model 79 | 80 | We recommend reading through `train.py` to get a high-level overview. 81 | 82 | Once you get the high-level idea, depending on your task and dataset, you might want to modify 83 | - `model/model_fn.py` to change the model's architecture, i.e. how you transform your input into your prediction as well as your loss, etc. 84 | - `model/input_fn` to change the process of feeding data to the model. 85 | - `train.py` and `evaluate.py` to change the story-line (maybe you need to change the filenames, load a vocabulary, etc.) 86 | 87 | Once you get something working for your dataset, feel free to edit any part of the code to suit your own needs. 88 | 89 | 90 | ### Graph, Session and nodes 91 | 92 | When designing a Model in Tensorflow, there are [basically 2 steps](https://www.tensorflow.org/get_started/get_started#tensorflow_core_tutorial) 93 | 1. building the computational graph, the nodes and operations and how they are connected to each other 94 | 2. evaluating / running this graph on some data 95 | 96 | As an example of __step 1__, if we define a TF constant (=a graph node), when we print it, we get a *Tensor* object (= a node) and not its value 97 | 98 | ```python 99 | x = tf.constant(1., dtype=tf.float32, name="my-node-x") 100 | print(x) 101 | > Tensor("my-node-x:0", shape=(), dtype=float32) 102 | ``` 103 | 104 | Now, let's get to __step 2__, and evaluate this node. We'll need to create a `tf.Session` that will take care of actually evaluating the graph 105 | 106 | ```python 107 | with tf.Session() as sess: 108 | print(sess.run(x)) 109 | > 1.0 110 | ``` 111 | 112 | 113 | In the code examples, 114 | 115 | - __step 1__ `model/input_fn.py` and `model/model_fn` 116 | 117 | - __step 2__ `model/training.py` and `model/evaluation.py` 118 | 119 | ### A word about [variable scopes](https://www.tensorflow.org/versions/r0.12/how_tos/variable_scope/#the_problem) 120 | 121 | When creating a node, Tensorflow will have a name for it. You can add a prefix to the nodes names. This is done with the `variable_scope` mechanism 122 | 123 | ```python 124 | with tf.variable_scope('model'): 125 | x1 = tf.get_variable('x', [], dtype=tf.float32) # get or create variable with name 'model/x:0' 126 | print(x1) 127 | > 128 | ``` 129 | 130 | > What happens if I instantiate `x` twice ? 131 | 132 | ```python 133 | with tf.variable_scope('model'): 134 | x2 = tf.get_variable('x', [], dtype=tf.float32) 135 | > ValueError: Variable model/x already exists, disallowed. 136 | ``` 137 | 138 | When trying to create a new variable named `model/x`, we run into an Exception as a variable with the same name already exists. Thanks to this naming mechanism, you can actually control which value you give to the different nodes, and at different points of your code, decide to have 2 python objects correspond to the same node ! 139 | 140 | ```python 141 | with tf.variable_scope('model', reuse=True): 142 | x2 = tf.get_variable('x', [], dtype=tf.float32) 143 | print(x2) 144 | > 145 | ``` 146 | 147 | We can check that they indeed have the same value 148 | ```python 149 | with tf.Session() as sess: 150 | sess.run(tf.global_variables_initializer()) # Initialize the Variables 151 | sess.run(tf.assign(x1, tf.constant(1.))) # Change the value of x1 152 | sess.run(tf.assign(x2, tf.constant(2.))) # Change the value of x2 153 | print("x1 = ", sess.run(x1), " x2 = ", sess.run(x2)) 154 | 155 | > x1 = 2.0 x2 = 2.0 156 | ``` 157 | 158 | 159 | ### How we deal with different Training / Evaluation Graphs 160 | 161 | Code examples design choice: theoretically, the graphs you define for training and inference can be different, but they still need to share their weights. To remedy this issue, there are two possibilities 162 | 163 | 1. re-build the graph, create a new session and reload the weights from some file when we switch between training and inference. 164 | 2. create all the nodes for training and inference in the graph and make sure that the python code does not create the nodes twice by using the `reuse=True` trick explained above. 165 | 166 | We decided to go for this option. As you'll notice in `train.py` we give an extra argument when we build our graphs 167 | 168 | ```python 169 | train_model_spec = model_fn('train', train_inputs, params) 170 | eval_model_spec = model_fn('eval', eval_inputs, params, reuse=True) 171 | ``` 172 | 173 | When we create the graph for the evaluation (`eval_model_spec`), the `model_fn` will encapsulate all the nodes in a `tf.variable_scope("model", reuse=True)` so that the nodes that have the same names than in the training graph share their weights ! 174 | 175 | For those interested in the problem of making training and eval graphs coexist, you can read this [discussion](https://www.tensorflow.org/tutorials/seq2seq#building_training_eval_and_inference_graphs) which advocates for the other option. 176 | 177 | > As a side note, option 1 is also the one used in [`tf.Estimator`](https://www.tensorflow.org/get_started/estimator). 178 | 179 |
    180 |
    181 |
    182 |
    183 |
    184 | 185 | Now, let's see how we can input data to our model. 186 | 187 |

    > Building the input data pipeline

    188 | 189 | 190 | 191 | [github]: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 192 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 193 | [post-3]: https://cs230-stanford.github.io/tensorflow-input-data.html 194 | [post-4]: https://cs230-stanford.github.io/tensorflow-model.html 195 | [tf-post]: https://cs230-stanford.github.io/tensorflow-psp.html 196 | [tf-data]: https://cs230-stanford.github.io/tensorflow-input-data.html 197 | -------------------------------------------------------------------------------- /_posts/2018-06-02-session-4.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Data preprocessing" 4 | description: "" 5 | excerpt: "Session 4" 6 | author: "Kian Katanforoosh, Andrew Ng" 7 | date: 2018-06-02 8 | mathjax: true 9 | published: true 10 | tags: tensorflow 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 12 | module: Tutorials 13 | --- 14 | 15 | ## Part I - Image data preprocessing 16 | 17 | In this part, you will use the popular package “skimage” to preprocess and augment an image before sending it to a neural network coded in Keras. 18 | 19 | ```python 20 | import numpy as np 21 | import pandas as pd 22 | from keras.models import Sequential 23 | from keras import optimizers 24 | from keras.utils import np_utils 25 | from keras.models import Sequential 26 | from keras.layers import Dense, Conv2D, Embedding, Activation, MaxPooling2D, Dropout 27 | from keras.layers import Flatten, LSTM, ZeroPadding2D, BatchNormalization, MaxPooling2D 28 | 29 | %matplotlib inline 30 | import matplotlib.pyplot as plt 31 | ``` 32 | 33 | **Question 1:** Use skimage to load your “iguana.jpg” and display it in your notebook. 34 | 35 | ```python 36 | from skimage.measure import compare_ssim as ssim 37 | from skimage import io 38 | from skimage.transform import resize 39 | 40 | # Loading the image 41 | ### START CODE HERE ### 42 | 43 | ### END CODE HERE ### 44 | ``` 45 | 46 | **Question 2:** Use skimage to zoom on the face of the iguana. Display the image. 47 | 48 | ```python 49 | # Zoom image 50 | ### START CODE HERE ### 51 | 52 | ### END CODE HERE ### 53 | ``` 54 | 55 | **Question 3:** Use skimage to rescale the image to 20% of the initial size of the image. Display the image. Rescaling means lowering the resolution of the image. Remember that in class we talked about finding the computation/accuracy trade-off by showing different resolutions of the same image to humans and figuring out what is the minimum resolution leading to the maximum human accuracy. 56 | 57 | ```python 58 | # Rescale image to 25% of the initial size 59 | ### START CODE HERE ### 60 | 61 | ### END CODE HERE ### 62 | ``` 63 | 64 | **Question 4:** Use skimage to add random noise to the image. Display the image. 65 | 66 | ```python 67 | # Add random noise 68 | ### START CODE HERE ### 69 | 70 | ### END CODE HERE ### 71 | ``` 72 | 73 | **Question 5:** Use skimage to rotate the image by 45 degrees. 74 | 75 | ```python 76 | # Rotate 77 | ### START CODE HERE ### 78 | 79 | ### END CODE HERE ### 80 | ``` 81 | 82 | **Question 6:** Use skimage to flip the image horizontaly and verticaly. Display the image. 83 | 84 | ```python 85 | # Horizontal flip 86 | ### START CODE HERE ### 87 | 88 | ### END CODE HERE ### 89 | ``` 90 | 91 | ```python 92 | # Vertical flip 93 | ### START CODE HERE ### 94 | 95 | ### END CODE HERE ### 96 | ``` 97 | 98 | **Question 7:** (Optional) Use skimage to (i) blur the image, (ii) enhance its contrast, (iii) convert to grayscale, (iv) invert colors… 99 | 100 | ```python 101 | # Blur image 102 | ### START CODE HERE ### 103 | 104 | ### END CODE HERE ### 105 | 106 | # Convert to grayscale 107 | ### START CODE HERE ### 108 | 109 | ### END CODE HERE ### 110 | 111 | # Enhance contrast 112 | ### START CODE HERE ### 113 | 114 | ### END CODE HERE ### 115 | 116 | # Color inversion 117 | ### START CODE HERE ### 118 | 119 | ### END CODE HERE ### 120 | ``` 121 | 122 | 123 | Skimage is a popular package for customized data preprocessing and augmentation. However, deep learning frameworks such as Keras often incorporate functions to help you preprocess data in a few lines of code. 124 | 125 | **Question 8:** Read and run the Keras code for image preprocessing. It will save augmented images in a folder called “preview” on the notebook’s directory. 126 | 127 | # Image preprocessing in Keras 128 | 129 | ```python 130 | # Image preprocessing in Keras 131 | 132 | from keras.preprocessing.image import ImageDataGenerator, array_to_img, img_to_array, load_img 133 | 134 | datagen = ImageDataGenerator( 135 | rotation_range=45, 136 | width_shift_range=0.3, 137 | height_shift_range=0.3, 138 | shear_range=0.3, 139 | zoom_range=0.3, 140 | horizontal_flip=True, 141 | fill_mode='nearest') 142 | 143 | img = load_img('iguana.jpg') # this is a PIL image 144 | x = img_to_array(img) # convert image to numpy array 145 | x = x.reshape((1,) + x.shape) # reshape image to (1, ..,..,..) to fit keras' standard shape 146 | 147 | # Use flow() to apply data augmentation randomly according to the datagenerator 148 | # and saves the results to the `preview/` directory 149 | num_image_generated = 0 150 | for batch in datagen.flow(x, batch_size=1, save_to_dir='preview', save_prefix='cat', save_format='jpeg'): 151 | num_image_generated += 1 152 | if num_image_generated > 20: 153 | break # stop the loop after num_image_generated iterations 154 | ``` 155 | 156 | **Question 9:** (Optional) Train the CNN coded for you in the notebook (See Appendix below) on any of the pictures you created. Evaluate the model. 157 | 158 | ## Part II - Text data preprocessing 159 | 160 | **Question 1:** Go on any static website online. Click right and select “View Page Source”. Copy a complicated part of the html code. Paste it in the notebook in the variable “html_page”. 161 | 162 | ```python 163 | ### START CODE HERE ### 164 | html_txt = """ """ 165 | ### END CODE HERE ### 166 | 167 | print(html_txt) 168 | ``` 169 | 170 | **Question 2:** Use *BeautifulSoup* to parse the html_txt. Print the html_txt. 171 | 172 | ```python 173 | from bs4 import BeautifulSoup 174 | 175 | # Parse the html input 176 | ### START CODE HERE ### 177 | 178 | ### END CODE HERE ### 179 | 180 | print(html_txt) 181 | ``` 182 | 183 | **Question 3:** Use *re* to remove meta-characters such as squared brackets and anything between them. Print the html_txt. 184 | 185 | ```python 186 | import re, string, unicodedata 187 | # Remove meta characters and things between them. 188 | ### START CODE HERE ### 189 | 190 | ### END CODE HERE ### 191 | 192 | print(html_txt) 193 | ``` 194 | 195 | **Question 4:** Using the Natural Language ToolKit (nltk), separate the text into a list of words. 196 | 197 | ```python 198 | import nltk 199 | from nltk import word_tokenize, sent_tokenize 200 | 201 | # Separate text into words 202 | ### START CODE HERE ### 203 | 204 | ### END CODE HERE ### 205 | ``` 206 | 207 | **Question 5:** (Optional) Remove non ASCII characters. Convert to Lower case. Remove punctuation, stopwords, … 208 | 209 | ```python 210 | ### START CODE HERE ### 211 | 212 | ### END CODE HERE ### 213 | ``` 214 | 215 | A machine will not be able to read this list strings, you need to build a vocabulary and tokenize your words. 216 | 217 | **Question 6:** Build the vocabulary from the list of words. 218 | 219 | ```python 220 | # Build Vocabulary 221 | ### START CODE HERE ### 222 | 223 | ### END CODE HERE ### 224 | ``` 225 | 226 | **Question 7**: Build word to integer mapping in Python. It should be sorted. 227 | 228 | ``` 229 | # Build word to integer mapping in Python. It should be sorted. 230 | ### START CODE HERE ### 231 | 232 | ### END CODE HERE ### 233 | ``` 234 | 235 | **Question 8**: Tokenize your text. 236 | 237 | ```python 238 | # Convert list of words into list of tokens using this mapping 239 | ### START CODE HERE ### 240 | 241 | ### END CODE HERE ### 242 | ``` 243 | 244 | **Question 9**: Read and run the Keras code for text preprocessing. It uses the Tokenizer Function. 245 | 246 | ```python 247 | # Preprocess text with Keras for Sentiment classification 248 | from keras.preprocessing.text import Tokenizer 249 | from keras.preprocessing.sequence import pad_sequences 250 | 251 | examples = ['You are amazing!','It is so bad','Congratulations','You suck bro','Awesome dude!'] 252 | Y = [1, 0, 1, 0, 1] 253 | 254 | # Define Tokenizer 255 | t = Tokenizer() 256 | # Fit Tokenizer on text (Build vocab etc..) 257 | t.fit_on_texts(examples) 258 | # Convert texts to sequences of integers 259 | X = t.texts_to_sequences(examples) 260 | # Pad sequences of integers 261 | X = pad_sequences(X, padding = 'post') 262 | 263 | # Get the vocabulary size, useful for the embedding layer. 264 | vocab_size = len(t.word_index) + 1 265 | print(vocab_size) 266 | print(X) 267 | ``` 268 | 269 | **Question 10**: (Optional) Train the RNN coded for you in the notebook on the sentiment classification class (with 5 examples). Evaluate the mode. 270 | 271 | ## Appendix: Models and training codes 272 | 273 | ```python 274 | # CNN 275 | model_CNN = Sequential() 276 | model_CNN.add(Conv2D(32, (7, 7), strides = (1, 1), name = 'conv0', input_shape = image.shape)) 277 | model_CNN.add(BatchNormalization(axis = 3, name = 'bn0')) 278 | model_CNN.add(Activation('relu')) 279 | model_CNN.add(MaxPooling2D((2, 2), name='max_pool')) 280 | model_CNN.add(Flatten()) 281 | model_CNN.add(Dense(1, activation='sigmoid', name='fc')) 282 | ``` 283 | 284 | ```python 285 | # RNN 286 | model_RNN = Sequential() 287 | model_RNN.add(Embedding(vocab_size, 128)) 288 | model_RNN.add(LSTM(128)) 289 | model_RNN.add(Dense(1, activation='sigmoid')) 290 | ``` 291 | 292 | ```python 293 | # training code for CNN 294 | sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 295 | model_CNN.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) 296 | model_CNN.fit(np.expand_dims(image, axis=0), np.array([1]), epochs=2) 297 | ``` 298 | 299 | ``` 300 | # training code for RNN 301 | sgd = optimizers.SGD(lr=0.01, decay=1e-6, momentum=0.9, nesterov=True) 302 | model_RNN.compile(loss='binary_crossentropy', optimizer=sgd, metrics=['accuracy']) 303 | model_RNN.fit(np.array(X), np.array(Y), epochs=1000) 304 | ``` 305 | 306 | ``` 307 | # testing code for CNN 308 | model_CNN.predict(np.expand_dims(image_blured, axis=0)) 309 | ``` 310 | 311 | -------------------------------------------------------------------------------- /_posts/2018-02-01-pytorch-vision.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Classifying Images of Hand Signs" 4 | description: "Defining a Convolutional Network and Loading Image Signs" 5 | excerpt: "Defining a Convolutional Network and Loading Image Data" 6 | author: "Teaching assistants Surag Nair, Guillaume Genthial and Olivier Moindrot" 7 | date: 2018-01-31 8 | mathjax: true 9 | published: true 10 | tags: pytorch vision 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/pytorch/vision 12 | module: Tutorials 13 | --- 14 | 15 | 16 | 17 | This post follows the [main post][post-1] announcing the CS230 Project Code Examples and the [PyTorch Introduction][pt-start]. In this post, we go through an example from Computer Vision, in which we learn how to load images of hand signs and classify them. 18 | 19 | 20 | This tutorial is among a series explaining the code examples: 21 | 22 | - [getting started][post-1]: installation, getting started with the code for the projects 23 | - [PyTorch Introduction][pt-start]: global structure of the PyTorch code examples 24 | - **this post**: predicting labels from images of hand signs 25 | - [NLP][pt-nlp]: Named Entity Recognition (NER) tagging for sentences 26 | 27 | __Goals of this tutorial__ 28 | - learn how to use PyTorch to load image data efficiently 29 | - specify a convolutional neural network 30 | - understand the key aspects of the code well-enough to modify it to suit your needs 31 | 32 | __Table of Contents__ 33 | 34 | * TOC 35 | {:toc} 36 | 37 |
     
    38 | 39 | --- 40 |
     
    41 | 42 | ### Problem Setup 43 | 44 | We use images from deeplearning.ai's SIGNS dataset that you have used in one of [Course 2][course2]'s programming assignment. Each image from this dataset is a picture of a hand making a sign that represents a number between 1 and 6. It is 1080 training images and 120 test images. In our example, we use images scaled down to size `64x64`. 45 | 46 | 47 | ### Making a PyTorch Dataset 48 | 49 | `torch.utils.data` provides some nifty functionality for loading data. We use `torch.utils.data.Dataset`, which is an abstract class representing a dataset. To make our own SIGNSDataset class, we need to inherit the `Dataset` class and override the following methods: 50 | - `__len__`: so that `len(dataset)` returns the size of the dataset 51 | - `__getitem__`: to support indexing using `dataset[i]` to get the ith image 52 | 53 | We then define our class as below: 54 | ```python 55 | from PIL import Image 56 | from torch.utils.data import Dataset, DataLoader 57 | 58 | class SIGNSDataset(Dataset): 59 | def __init__(self, data_dir, transform): 60 | # store filenames 61 | self.filenames = os.listdir(data_dir) 62 | self.filenames = [os.path.join(data_dir, f) for f in self.filenames] 63 | 64 | # the first character of the filename contains the label 65 | self.labels = [int(filename.split('/')[-1][0]) for filename in self.filenames] 66 | self.transform = transform 67 | 68 | def __len__(self): 69 | # return size of dataset 70 | return len(self.filenames) 71 | 72 | def __getitem__(self, idx): 73 | # open image, apply transforms and return with label 74 | image = Image.open(self.filenames[idx]) # PIL image 75 | image = self.transform(image) 76 | return image, self.labels[idx] 77 | ``` 78 | 79 | Notice that when we return an image-label pair using `__getitem__` we apply a `tranform` on the image. These transformations are a part of the `torchvision.transforms` [package](http://pytorch.org/docs/master/torchvision/transforms.html), that allow us to manipulate images easily. Consider the following composition of multiple transforms: 80 | 81 | ```python 82 | train_transformer = transforms.Compose([ 83 | transforms.Resize(64), # resize the image to 64x64 84 | transforms.RandomHorizontalFlip(), # randomly flip image horizontally 85 | transforms.ToTensor()]) # transform it into a PyTorch Tensor 86 | ``` 87 | 88 | When we apply `self.transform(image)` in `__getitem__`, we pass it through the above transformations before using it as a training example. The final output is a PyTorch Tensor. To augment the dataset during training, we also use the `RandomHorizontalFlip` transform when loading the image. We can specify a similar `eval_transformer` for evaluation without the random flip. To load a `Dataset` object for the different splits of our data, we simply use: 89 | 90 | ```python 91 | train_dataset = SIGNSDataset(train_data_path, train_transformer) 92 | val_dataset = SIGNSDataset(val_data_path, eval_transformer) 93 | test_dataset = SIGNSDataset(test_data_path, eval_transformer) 94 | ``` 95 | 96 | ### Loading Batches of Data 97 | 98 | `torch.utils.data.DataLoader` provides an iterator that takes in a `Dataset` object and performs batching, shuffling and loading of the data. This is crucial when images are big in size and take time to load. In such a case, the GPU can be left idling while the CPU fetches the images from file and then applies the transforms. In contrast, the DataLoader class (using multiprocessing) fetches the data asynchronously and prefetches batches to be sent to the GPU. Initialising the `DataLoader` is quite easy: 99 | 100 | ```python 101 | train_dataloader = DataLoader(SIGNSDataset(train_data_path, train_transformer), 102 | batch_size=hyperparams.batch_size, shuffle=True, 103 | num_workers=hyperparams.num_workers) 104 | ``` 105 | 106 | We can then iterate through batches of examples as follows: 107 | ```python 108 | for train_batch, labels_batch in train_dataloader: 109 | # wrap Tensors in Variables 110 | train_batch, labels_batch = Variable(train_batch), Variable(labels_batch) 111 | 112 | # pass through model, perform backpropagation and updates 113 | output_batch = model(train_batch) 114 | ... 115 | ``` 116 | 117 | Applying transformations on the data loads them as PyTorch Tensors. We wrap them in PyTorch Variables before passing them into the model. The `for` loop ends after one pass over the data, i.e. after one epoch. It can be reused again for another epoch without any changes. We can use similar data loaders for validation and test data. 118 | 119 | ### Convolutional Network Model 120 | 121 | Now that we have figured out how to load our images, let's have a look at the *pièce de résistance*- the CNN model. As mentioned in the [previous][pt-start] post, we first define the components of our model, followed by its functional form. Let's have a look at the `__init__` function for our model that takes in a `3x64x64` image: 122 | 123 | ```python 124 | import torch.nn as nn 125 | import torch.nn.functional as F 126 | 127 | class Net(nn.Module): 128 | def __init__(self): 129 | # we define convolutional layers 130 | self.conv1 = nn.Conv2d(in_channels = 3, out_channels = 32, kernel_size = 3, strid = 1, padding = 1) 131 | self.bn1 = nn.BatchNorm2d(32) 132 | self.conv2 = nn.Conv2d(in_channels = 32, out_channels = 64, kernel_size = 3, stride = 1, padding = 1) 133 | self.bn2 = nn.BatchNorm2d(64) 134 | self.conv3 = nn.Conv2d(in_channels = 64, in_channels = 128, kernel_size = 3, stride 1, padding = 1) 135 | self.bn3 = nn.BatchNorm2d(128) 136 | 137 | # 2 fully connected layers to transform the output of the convolution layers to the final output 138 | self.fc1 = nn.Linear(in_features = 8*8*128, out_features = 128) 139 | self.fcbn1 = nn.BatchNorm1d(128) 140 | self.fc2 = nn.Linear(in_features = 128, out_features = 6) 141 | self.dropout_rate = hyperparams.dropout_rate 142 | ``` 143 | 144 | The first parameter to the convolutional filter `nn.Conv2d` is the number of input channels, the second is the number of output channels, and the third is the size of the square filter (`3x3` in this case). Similarly, the batch normalisation layer takes as input the number of channels for 2D images and the number of features in the 1D case. The fully connected `Linear` layers take the input and output dimensions. 145 | 146 | In this example, we explicitly specify each of the values. In order to make the initialisation of the model more flexible, you can pass in parameters such as image size to the `__init__` function and use that to specify the sizes. You must be very careful when specifying parameter dimensions, since mismatches will lead to errors in the forward propagation. Let's now look at the forward propagation: 147 | 148 | ```python 149 | def forward(self, s): 150 | # we apply the convolution layers, followed by batch normalisation, 151 | # maxpool and relu x 3 152 | s = self.bn1(self.conv1(s)) # batch_size x 32 x 64 x 64 153 | s = F.relu(F.max_pool2d(s, 2)) # batch_size x 32 x 32 x 32 154 | s = self.bn2(self.conv2(s)) # batch_size x 64 x 32 x 32 155 | s = F.relu(F.max_pool2d(s, 2)) # batch_size x 64 x 16 x 16 156 | s = self.bn3(self.conv3(s)) # batch_size x 128 x 16 x 16 157 | s = F.relu(F.max_pool2d(s, 2)) # batch_size x 128 x 8 x 8 158 | 159 | # flatten the output for each image 160 | s = s.view(-1, 8*8*128) # batch_size x 8*8*128 161 | 162 | # apply 2 fully connected layers with dropout 163 | s = F.dropout(F.relu(self.fcbn1(self.fc1(s))), 164 | p=self.dropout_rate, training=self.training) # batch_size x 128 165 | s = self.fc2(s) # batch_size x 6 166 | 167 | return F.log_softmax(s, dim=1) 168 | ``` 169 | 170 | We pass the image through 3 layers of `conv > bn > max_pool > relu`, followed by flattening the image and then applying 2 fully connected layers. In flattening the output of the convolution layers to a single vector per image, we use `s.view(-1, 8*8*128)`. Here the size `-1` is implicitly inferred from the other dimension (batch size in this case). The output is a log\_softmax over the 6 labels for each example in the batch. We use log\_softmax since it is numerically more stable than first taking the softmax and then the log. 171 | 172 | And that's it! We use an appropriate loss function (Negative Loss Likelihood, since the output is already softmax-ed and log-ed) and train the model as discussed in the [previous][pt-start] post. Remember, you can set a breakpoint using `pdb.set_trace()` at any place in the forward function, examine the dimensions of the Variables, tinker around and diagnose what's going wrong. That's the beauty of PyTorch :). 173 | 174 | ### Resources 175 | - [Data Loading and Processing Tutorial](http://pytorch.org/tutorials/beginner/data_loading_tutorial.html): an official tutorial from the PyTorch website 176 | - [ImageNet](https://github.com/pytorch/examples/blob/master/imagenet/main.py): Code for training on ImageNet in PyTorch 177 | 178 |
     
    179 | 180 | --- 181 | 182 |
     
    183 | 184 | That concludes the description of the PyTorch Vision code example. You can proceed to the [NLP][pt-nlp] example to understand how we load data and define models for text. 185 | 186 | 187 | 188 | [github]: https://github.com/cs230-stanford/cs230-code-examples 189 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 190 | [pt-start]: https://cs230-stanford.github.io/pytorch-getting-started.html 191 | [pt-nlp]: https://cs230-stanford.github.io/pytorch-nlp.html 192 | [course2]: https://www.coursera.org/learn/deep-neural-network -------------------------------------------------------------------------------- /_posts/2018-02-01-pytorch-nlp.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Named Entity Recognition Tagging" 4 | description: "NER Tagging in PyTorch" 5 | excerpt: "Defining a Recurrent Network and Loading Text Data" 6 | author: "Teaching assistants Surag Nair, Guillaume Genthial, Olivier Moindrot" 7 | date: 2018-01-31 8 | mathjax: true 9 | published: true 10 | tags: pytorch nlp 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/pytorch/nlp 12 | module: Tutorials 13 | --- 14 | 15 | 16 | 17 | This post follows the [main post][post-1] announcing the CS230 Project Code Examples and the [PyTorch Introduction][pt-start]. In this post, we go through an example from Natural Language Processing, in which we learn how to load text data and perform Named Entity Recognition (NER) tagging for each token. 18 | 19 | 20 | This tutorial is among a series explaining the code examples: 21 | 22 | - [getting started][post-1]: installation, getting started with the code for the projects 23 | - [PyTorch Introduction][pt-start]: global structure of the PyTorch code examples 24 | - [Vision][pt-vision]: predicting labels from images of hand signs 25 | - **this post**: Named Entity Recognition (NER) tagging for sentences 26 | 27 | __Goals of this tutorial__ 28 | - learn how to use PyTorch to load sequential data 29 | - specify a recurrent neural network 30 | - understand the key aspects of the code well-enough to modify it to suit your needs 31 | 32 | __Table of Contents__ 33 | 34 | * TOC 35 | {:toc} 36 | 37 |
     
    38 | 39 | --- 40 |
     
    41 | 42 | ### Problem Setup 43 | 44 | We explore the problem of [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition) (NER) tagging of sentences. The task is to tag each token in a given sentence with an appropriate tag such as Person, Location, etc. 45 | 46 | ``` 47 | John lives in New York 48 | B-PER O O B-LOC I-LOC 49 | ``` 50 | 51 | Our dataset will thus need to load both the sentences and labels. We will store those in 2 different files, a `sentence.txt` file containing the sentences (one per line) and a `labels.txt` containing the labels. For example: 52 | 53 | ``` 54 | # sentences.txt 55 | John lives in New York 56 | Where is John ? 57 | ``` 58 | 59 | ``` 60 | # labels.txt 61 | B-PER O O B-LOC I-LOC 62 | O O B-PER O 63 | ``` 64 | 65 | Here we assume that we ran the `build_vocab.py` script that creates a vocabulary file in our `/data` directory. Running the script gives us one file for the words and one file for the labels. They will contain one token per line. For instance 66 | 67 | ``` 68 | # words.txt 69 | John 70 | lives 71 | in 72 | ... 73 | ``` 74 | 75 | and 76 | 77 | ``` 78 | # tags.txt 79 | B-PER 80 | B-LOC 81 | ... 82 | ``` 83 | 84 | ### Loading the Text Data 85 | 86 | In NLP applications, a sentence is represented by the sequence of indices of the words in the sentence. For example if our vocabulary is `{'is':1, 'John':2, 'Where':3, '.':4, '?':5}` then the sentence "Where is John ?" is represented as `[3,1,2,5]`. We read the `words.txt` file and populate our vocabulary: 87 | 88 | ```python 89 | vocab = {} 90 | with open(words_path) as f: 91 | for i, l in enumerate(f.read().splitlines()): 92 | vocab[l] = i 93 | ``` 94 | 95 | In a similar way, we load a mapping `tag_map` from our labels from `tags.txt` to indices. Doing so gives us indices for labels in the range `[0,1,...,NUM_TAGS-1]`. 96 | 97 | In addition to words read from English sentences, `words.txt` contains two special tokens: an `UNK` token to represent any word that is not present in the vocabulary, and a `PAD` token that is used as a filler token at the end of a sentence when one batch has sentences of unequal lengths. 98 | 99 | We are now ready to load our data. We read the sentences in our dataset (either train, validation or test) and convert them to a sequence of indices by looking up the vocabulary: 100 | 101 | ```python 102 | train_sentences = [] 103 | train_labels = [] 104 | 105 | with open(train_sentences_file) as f: 106 | for sentence in f.read().splitlines(): 107 | # replace each token by its index if it is in vocab 108 | # else use index of UNK 109 | s = [vocab[token] if token in self.vocab 110 | else vocab['UNK'] 111 | for token in sentence.split(' ')] 112 | train_sentences.append(s) 113 | 114 | with open(train_labels_file) as f: 115 | for sentence in f.read().splitlines(): 116 | # replace each label by its index 117 | l = [tag_map[label] for label in sentence.split(' ')] 118 | train_labels.append(l) 119 | ``` 120 | We can load the validation and test data in a similar fashion. 121 | 122 | 123 | ### Preparing a Batch 124 | 125 | This is where it gets fun. When we sample a batch of sentences, not all the sentences usually have the same length. Let's say we have a batch of sentences `batch_sentences` that is a Python list of lists, with its corresponding `batch_tags` which has a tag for each token in `batch_sentences`. We convert them into a batch of PyTorch Variables as follows: 126 | 127 | ```python 128 | # compute length of longest sentence in batch 129 | batch_max_len = max([len(s) for s in batch_sentences]) 130 | 131 | # prepare a numpy array with the data, initializing the data with 'PAD' 132 | # and all labels with -1; initializing labels to -1 differentiates tokens 133 | # with tags from 'PAD' tokens 134 | batch_data = vocab['PAD']*np.ones((len(batch_sentences), batch_max_len)) 135 | batch_labels = -1*np.ones((len(batch_sentences), batch_max_len)) 136 | 137 | # copy the data to the numpy array 138 | for j in range(len(batch_sentences)): 139 | cur_len = len(batch_sentences[j]) 140 | batch_data[j][:cur_len] = batch_sentences[j] 141 | batch_labels[j][:cur_len] = batch_tags[j] 142 | 143 | # since all data are indices, we convert them to torch LongTensors 144 | batch_data, batch_labels = torch.LongTensor(batch_data), torch.LongTensor(batch_labels) 145 | 146 | # convert Tensors to Variables 147 | batch_data, batch_labels = Variable(batch_data), Variable(batch_labels) 148 | ``` 149 | 150 | A lot of things happened in the above code. We first calculated the length of the longest sentence in the batch. We then initialized NumPy arrays of dimension `(num_sentences, batch_max_len)` for the sentence and labels, and filled them in from the lists. Since the values are indices (and not floats), PyTorch's Embedding layer expects inputs to be of the `Long` type. We hence convert them to `LongTensor`. 151 | 152 | After filling them in, we observe that the sentences that are shorter than the longest sentence in the batch have the special token `PAD` to fill in the remaining space. Moreover, the `PAD` tokens, introduced as a result of packaging the sentences in a matrix, are assigned a label of -1. Doing so differentiates them from other tokens that have label indices in the range `[0,1,...,NUM_TAGS-1]`. This will be crucial when we calculate the loss for our model's prediction, and we'll come to that in a bit. 153 | 154 | In our code, we package the above code in a custom data\_iterator function. Hyperparameters are stored in a data structure called "params". We can then use the generator as follows: 155 | ```python 156 | # train_data contains train_sentences and train_labels 157 | # params contains batch_size 158 | train_iterator = data_iterator(train_data, params, shuffle=True) 159 | 160 | for _ in range(num_training_steps): 161 | batch_sentences, batch_labels = next(train_iterator) 162 | 163 | # pass through model, perform backpropagation and updates 164 | output_batch = model(train_batch) 165 | ... 166 | ``` 167 | 168 | ### Recurrent Network Model 169 | 170 | Now that we have figured out how to load our sentences and tags, let's have a look at the Recurrent Neural Network model. As mentioned in the [previous][pt-start] post, we first define the components of our model, followed by its functional form. Let's have a look at the `__init__` function for our model that takes in `(batch_size, batch_max_len)` dimensional data: 171 | 172 | ```python 173 | import torch.nn as nn 174 | import torch.nn.functional as F 175 | 176 | class Net(nn.Module): 177 | def __init__(self, params): 178 | super(Net, self).__init__() 179 | 180 | # maps each token to an embedding_dim vector 181 | self.embedding = nn.Embedding(params.vocab_size, params.embedding_dim) 182 | 183 | # the LSTM takens embedded sentence 184 | self.lstm = nn.LSTM(params.embedding_dim, params.lstm_hidden_dim, batch_first=True) 185 | 186 | # fc layer transforms the output to give the final output layer 187 | self.fc = nn.Linear(params.lstm_hidden_dim, params.number_of_tags) 188 | ``` 189 | 190 | We use an LSTM for the recurrent network. Before running the LSTM, we first transform each word in our sentence to a vector of dimension `embedding_dim`. We then run the LSTM over this sentence. Finally, we have a fully connected layer that transforms the output of the LSTM for each token to a distribution over tags. This is implemented in the forward propagation function: 191 | 192 | ```python 193 | def forward(self, s): 194 | # apply the embedding layer that maps each token to its embedding 195 | s = self.embedding(s) # dim: batch_size x batch_max_len x embedding_dim 196 | 197 | # run the LSTM along the sentences of length batch_max_len 198 | s, _ = self.lstm(s) # dim: batch_size x batch_max_len x lstm_hidden_dim 199 | 200 | # reshape the Variable so that each row contains one token 201 | s = s.view(-1, s.shape[2]) # dim: batch_size*batch_max_len x lstm_hidden_dim 202 | 203 | # apply the fully connected layer and obtain the output for each token 204 | s = self.fc(s) # dim: batch_size*batch_max_len x num_tags 205 | 206 | return F.log_softmax(s, dim=1) # dim: batch_size*batch_max_len x num_tags 207 | ``` 208 | 209 | The embedding layer augments an extra dimension to our input which then has shape `(batch_size, batch_max_len, embedding_dim)`. We run it through the LSTM which gives an output for each token of length `lstm_hidden_dim`. In the next step, we open up the 3D Variable and reshape it such that we get the hidden state for each token, i.e. the new dimension is `(batch_size*batch_max_len, lstm_hidden_dim)`. Here the `-1` is implicitly inferred to be equal to `batch_size*batch_max_len`. The reason behind this reshaping is that the fully connected layer assumes a 2D input, with one example along each row. 210 | 211 | After the reshaping, we apply the fully connected layer which gives a vector of `NUM_TAGS` for each token in each sentence. The output is a log\_softmax over the tags for each token. We use log\_softmax since it is numerically more stable than first taking the softmax and then the log. 212 | 213 | All that is left is to compute the loss. But there's a catch- we can't use a `torch.nn.loss` function straight out of the box because that would add the loss from the `PAD` tokens as well. Here's where the power of PyTorch comes into play- we can write our own custom loss function! 214 | 215 | ### Writing a Custom Loss Function 216 | 217 | In the [section](#batch) on preparing batches, we ensured that the labels for the `PAD` tokens were set to `-1`. We can leverage this to filter out the `PAD` tokens when we compute the loss. Let us see how: 218 | ```python 219 | def loss_fn(outputs, labels): 220 | # reshape labels to give a flat vector of length batch_size*seq_len 221 | labels = labels.view(-1) 222 | 223 | # mask out 'PAD' tokens 224 | mask = (labels >= 0).float() 225 | 226 | # the number of tokens is the sum of elements in mask 227 | num_tokens = int(torch.sum(mask).data[0]) 228 | 229 | # pick the values corresponding to labels and multiply by mask 230 | outputs = outputs[range(outputs.shape[0]), labels]*mask 231 | 232 | # cross entropy loss for all non 'PAD' tokens 233 | return -torch.sum(outputs)/num_tokens 234 | ``` 235 | The input labels has dimension `(batch_size, batch_max_len)` while outputs has dimension `(batch_size*batch_max_len, NUM_TAGS)`. We compute a mask using the fact that all `PAD` tokens in `labels` have the value `-1`. We then compute the Negative Log Likelihood Loss (remember the output from the network is already softmax-ed and log-ed!) for all the non `PAD` tokens. We can now compute derivates by simply calling `.backward()` on the loss returned by this function. 236 | 237 | Remember, you can set a breakpoint using `pdb.set_trace()` at any place in the forward function, loss function or virtually anywhere and examine the dimensions of the Variables, tinker around and diagnose what's going wrong. That's the beauty of PyTorch :). 238 | 239 | ### Resources 240 | - [Generating Names](http://pytorch.org/tutorials/intermediate/char_rnn_generation_tutorial.html#sphx-glr-intermediate-char-rnn-generation-tutorial-py): a tutorial on character-level RNN 241 | - [Sequence to Sequence models](http://pytorch.org/tutorials/intermediate/seq2seq_translation_tutorial.html#sphx-glr-intermediate-seq2seq-translation-tutorial-py): a tutorial on translation 242 | 243 |
     
    244 | 245 | --- 246 | 247 |
     
    248 | 249 | That concludes the description of the PyTorch NLP code example. If you haven't, take a look at the [Vision][pt-vision] example to understand how we load data and define models for images 250 | 251 | 252 | 253 | [github]: https://github.com/cs230-stanford/cs230-code-examples 254 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 255 | [pt-start]: https://cs230-stanford.github.io/pytorch-getting-started.html 256 | [pt-vision]: https://cs230-stanford.github.io/pytorch-vision.html 257 | -------------------------------------------------------------------------------- /_posts/2018-02-01-pytorch-getting-started.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Introduction to PyTorch Code Examples" 4 | description: "Tutorial for the PyTorch Code Examples" 5 | excerpt: "An overview of training, models, loss functions and optimizers" 6 | author: "Teaching assistants Surag Nair, Guillaume Genthial, Olivier Moindrot" 7 | date: 2018-01-31 8 | mathjax: true 9 | published: true 10 | tags: pytorch 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/pytorch 12 | module: Tutorials 13 | --- 14 | 15 | 16 | 17 | This post follows the [main post][post-1] announcing the CS230 Project Code Examples. 18 | Here we explain some details of the PyTorch part of the code from our [github repository][github]. 19 | 20 | ``` 21 | pytorch/ 22 | vision/ 23 | nlp/ 24 | ``` 25 | 26 | This tutorial is among a series explaining the code examples: 27 | 28 | - [getting started][post-1]: installation, getting started with the code for the projects 29 | - **this post**: global structure of the PyTorch code 30 | - [Vision][pt-vision]: predicting labels from images of hand signs 31 | - [NLP][pt-nlp]: Named Entity Recognition (NER) tagging for sentences 32 | 33 | __Goals of this tutorial__ 34 | - learn more about PyTorch 35 | - learn an example of how to correctly structure a deep learning project in PyTorch 36 | - understand the key aspects of the code well-enough to modify it to suit your needs 37 | 38 | __Table of Contents__ 39 | 40 | * TOC 41 | {:toc} 42 | 43 |
     
    44 | 45 | --- 46 |
     
    47 | ### Resources 48 | 49 | 50 | - The main PyTorch [homepage](http://pytorch.org/). 51 | - The [official tutorials](http://pytorch.org/tutorials/) cover a wide variety of use cases- attention based sequence to sequence models, Deep Q-Networks, neural transfer and much more! 52 | - A quick [crash course](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) in PyTorch. 53 | - Justin Johnson's [repository](https://github.com/jcjohnson/pytorch-examples) that introduces fundamental PyTorch concepts through self-contained examples. 54 | - Tons of resources in this [list](https://github.com/ritchieng/the-incredible-pytorch). 55 | 56 | ### Code Layout 57 | 58 | The code for each PyTorch example (Vision and NLP) shares a common structure: 59 | ``` 60 | data/ 61 | experiments/ 62 | model/ 63 | net.py 64 | data_loader.py 65 | train.py 66 | evaluate.py 67 | search_hyperparams.py 68 | synthesize_results.py 69 | evaluate.py 70 | utils.py 71 | ``` 72 | 73 | - `model/net.py`: specifies the neural network architecture, the loss function and evaluation metrics 74 | - `model/data_loader.py`: specifies how the data should be fed to the network 75 | - `train.py`: contains the main training loop 76 | - `evaluate.py`: contains the main loop for evaluating the model 77 | - `utils.py`: utility functions for handling hyperparams/logging/storing model 78 | 79 | We recommend reading through `train.py` to get a high-level overview. 80 | 81 | Once you get the high-level idea, depending on your task and dataset, you might want to modify 82 | - `model/net.py` to change the model, i.e. how you transform your input into your prediction as well as your loss, etc. 83 | - `model/data_loader.py` to change the way you feed data to the model. 84 | - `train.py` and `evaluate.py` to make changes specific to your problem, if required 85 | 86 | Once you get something working for your dataset, feel free to edit any part of the code to suit your own needs. 87 | 88 | ### Tensors and Variables 89 | 90 | Before going further, I strongly suggest you go through this [60 Minute Blitz with PyTorch](http://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html) to gain an understanding of PyTorch basics. Here's a sneak peak. 91 | 92 | PyTorch Tensors are similar in behaviour to NumPy’s arrays. 93 | ```python 94 | >>> import torch 95 | >>> a = torch.Tensor([[1,2],[3,4]]) 96 | >>> print(a) 97 | 1 2 98 | 3 4 99 | [torch.FloatTensor of size 2x2] 100 | 101 | >>> print(a**2) 102 | 1 4 103 | 9 16 104 | [torch.FloatTensor of size 2x2] 105 | ```` 106 | 107 | PyTorch Variables allow you to wrap a Tensor and record operations performed on it. This allows you to perform automatic differentiation. 108 | 109 | ```python 110 | >>> from torch.autograd import Variable 111 | >>> a = Variable(torch.Tensor([[1,2],[3,4]]), requires_grad=True) 112 | >>> print(a) 113 | Variable containing: 114 | 1 2 115 | 3 4 116 | [torch.FloatTensor of size 2x2] 117 | 118 | >>> y = torch.sum(a**2) # 1 + 4 + 9 + 16 119 | >>> print(y) 120 | Variable containing: 121 | 30 122 | [torch.FloatTensor of size 1] 123 | 124 | >>> y.backward() # compute gradients of y wrt a 125 | >>> print(a.grad) # print dy/da_ij = 2*a_ij for a_11, a_12, a21, a22 126 | Variable containing: 127 | 2 4 128 | 6 8 129 | [torch.FloatTensor of size 2x2] 130 | ``` 131 | 132 | This prelude should give you a sense of the things to come. PyTorch packs elegance and expressiveness in its minimalist and intuitive syntax. Familiarize yourself with some more examples from the [Resources](#resources) section before moving ahead. 133 | 134 | ### Core Training Step 135 | 136 | Let's begin with a look at what the heart of our training algorithm looks like. The five lines below pass a batch of inputs through the model, calculate the loss, perform backpropagation and update the parameters. 137 | 138 | ```python 139 | output_batch = model(train_batch) # compute model output 140 | loss = loss_fn(output_batch, labels_batch) # calculate loss 141 | 142 | optimizer.zero_grad() # clear previous gradients 143 | loss.backward() # compute gradients of all variables wrt loss 144 | 145 | optimizer.step() # perform updates using calculated gradients 146 | ```` 147 | 148 | Each of the variables `train_batch`, `labels_batch`, `output_batch` and `loss` is a PyTorch Variable and allows derivates to be automatically calculated. 149 | 150 | All the other code that we write is built around this- the exact specification of the model, how to fetch a batch of data and labels, computation of the loss and the details of the optimizer. In this post, we'll cover how to write a simple model in PyTorch, compute the loss and define an optimizer. The subsequent posts each cover a case of fetching data- one for image data and another for text data. 151 | 152 | ### Models in PyTorch 153 | A model can be defined in PyTorch by subclassing the `torch.nn.Module` class. The model is defined in two steps. We first specify the parameters of the model, and then outline how they are applied to the inputs. For operations that do not involve trainable parameters (activation functions such as ReLU, operations like maxpool), we generally use the `torch.nn.functional` module. Here's an example of a single hidden layer neural network borrowed from [here](https://github.com/jcjohnson/pytorch-examples#pytorch-custom-nn-modules): 154 | 155 | ```python 156 | import torch.nn as nn 157 | import torch.nn.functional as F 158 | 159 | class TwoLayerNet(nn.Module): 160 | def __init__(self, D_in, H, D_out): 161 | """ 162 | In the constructor we instantiate two nn.Linear modules and assign them as 163 | member variables. 164 | 165 | D_in: input dimension 166 | H: dimension of hidden layer 167 | D_out: output dimension 168 | """ 169 | super(TwoLayerNet, self).__init__() 170 | self.linear1 = nn.Linear(D_in, H) 171 | self.linear2 = nn.Linear(H, D_out) 172 | 173 | def forward(self, x): 174 | """ 175 | In the forward function we accept a Variable of input data and we must 176 | return a Variable of output data. We can use Modules defined in the 177 | constructor as well as arbitrary operators on Variables. 178 | """ 179 | h_relu = F.relu(self.linear1(x)) 180 | y_pred = self.linear2(h_relu) 181 | return y_pred 182 | ``` 183 | 184 | The `__init__` function initialises the two linear layers of the model. PyTorch takes care of the proper initialization of the parameters you specify. In the `forward` function, we first apply the first linear layer, apply ReLU activation and then apply the second linear layer. The module assumes that the first dimension of `x` is the batch size. If the input to the network is simply a vector of dimension 100, and the batch size is 32, then the dimension of `x` would be 32,100. Let's see an example of how to define a model and compute a forward pass: 185 | 186 | ```python 187 | # N is batch size; D_in is input dimension; 188 | # H is the dimension of the hidden layer; D_out is output dimension. 189 | N, D_in, H, D_out = 32, 100, 50, 10 190 | 191 | # Create random Tensors to hold inputs and outputs, and wrap them in Variables 192 | x = Variable(torch.randn(N, D_in)) # dim: 32 x 100 193 | 194 | # Construct our model by instantiating the class defined above 195 | model = TwoLayerNet(D_in, H, D_out) 196 | 197 | # Forward pass: Compute predicted y by passing x to the model 198 | y_pred = model(x) # dim: 32 x 10 199 | ``` 200 | More complex models follow the same layout, and we'll see two of them in the subsequent posts. 201 | 202 | ### Loss Function 203 | 204 | 205 | PyTorch comes with many standard loss functions available for you to use in the `torch.nn` [module](http://pytorch.org/docs/master/nn.html#loss-functions). Here's a simple example of how to calculate Cross Entropy Loss. Let's say our model solves a multi-class classification problem with `C` labels. Then for a batch of size `N`, `out` is a PyTorch Variable of dimension `NxC` that is obtained by passing an input batch through the model. We also have a `target` Variable of size `N`, where each element is the class for that example, i.e. a label in `[0,...,C-1]`. You can define the loss function and compute the loss as follows: 206 | 207 | 208 | ```python 209 | loss_fn = nn.CrossEntropyLoss() 210 | loss = loss_fn(out, target) 211 | ``` 212 | 213 | PyTorch makes it very easy to extend this and write your own custom loss function. We can write our own Cross Entropy Loss function as below (note the NumPy-esque syntax): 214 | ```python 215 | def myCrossEntropyLoss(outputs, labels): 216 | batch_size = outputs.size()[0] # batch_size 217 | outputs = F.log_softmax(outputs, dim=1) # compute the log of softmax values 218 | outputs = outputs[range(batch_size), labels] # pick the values corresponding to the labels 219 | return -torch.sum(outputs)/num_examples 220 | ``` 221 | 222 | This was a fairly simple example of writing our own loss function. In the section on [NLP][pt-nlp], we'll see an interesting use of custom loss functions. 223 | 224 | ### Optimizer 225 | 226 | The `torch.optim` [package](http://pytorch.org/docs/master/optim.html) provides an easy to use interface for common optimization algorithms. Defining your optimizer is really as simple as: 227 | 228 | ```python 229 | # pick an SGD optimizer 230 | optimizer = torch.optim.SGD(model.parameters(), lr = 0.01, momentum=0.9) 231 | 232 | # or pick ADAM 233 | optimizer = torch.optim.Adam(model.parameters(), lr = 0.0001) 234 | ``` 235 | 236 | You pass in the parameters of the model that need to be updated every iteration. You can also specify more complex methods such as per-layer or even per-parameter learning rates. 237 | 238 | Once gradients have been computed using `loss.backward()`, calling `optimizer.step()` updates the parameters as defined by the optimization algorithm. 239 | 240 | ### Training vs Evaluation 241 | 242 | Before training the model, it is imperative to call `model.train()`. Likewise, you must call `model.eval()` before testing the model. This corrects for the differences in dropout, batch normalization during training and testing. 243 | 244 | ### Computing Metrics 245 | By this stage you should be able to understand most of the code in `train.py` and `evaluate.py` (except how we fetch the data, which we'll come to in the subsequent posts). Apart from keeping an eye on the loss, it is also helpful to monitor other metrics such as accuracy and precision/recall. To do this, you can define your own metric functions for a batch of model outputs in the `model/net.py` file. In order to make it easier, we convert the PyTorch Variables into NumPy arrays before passing them into the metric functions. For a multi-class classification problem as set up in the section on [Loss Function](#lossfunc), we can write a function to compute accuracy using NumPy as: 246 | 247 | ```python 248 | def accuracy(out, labels): 249 | outputs = np.argmax(out, axis=1) 250 | return np.sum(outputs==labels)/float(labels.size) 251 | ``` 252 | 253 | You can add your own metrics in the `model/net.py` file. Once you are done, simply add them to the `metrics` dictionary: 254 | ```python 255 | metrics = { 'accuracy': accuracy, 256 | # add your own custom metrics, 257 | } 258 | ``` 259 | 260 | ### Saving and Loading Models 261 | 262 | We define utility functions to save and load models in `utils.py`. To save your model, call: 263 | ```python 264 | state = {'epoch': epoch + 1, 265 | 'state_dict': model.state_dict(), 266 | 'optim_dict' : optimizer.state_dict()} 267 | utils.save_checkpoint(state, 268 | is_best=is_best, # True if this is the model with best metrics 269 | checkpoint=model_dir) # path to folder 270 | ``` 271 | 272 | `utils.py` internally uses the `torch.save(state, filepath)` method to save the state dictionary that is defined above. You can add more items to the dictionary, such as metrics. The `model.state_dict()` stores the parameters of the model and `optimizer.state_dict()` stores the state of the optimizer (such as per-parameter learning rate). 273 | 274 | To load the saved state from a checkpoint, you may use: 275 | ```python 276 | utils.load_checkpoint(restore_path, model, optimizer) 277 | ``` 278 | 279 | The `optimizer` argument is optional and you may choose to restart with a new optimizer. `load_checkpoint` internally loads the saved checkpoint and restores the model weights and the state of the optimizer. 280 | 281 | ### Using the GPU 282 | 283 | Interspersed through the code you will find lines such as: 284 | ```python 285 | > model = net.Net(params).cuda() if params.cuda else net.Net(params) 286 | 287 | > if params.cuda: 288 | batch_data, batch_labels = batch_data.cuda(), batch_labels.cuda() 289 | ``` 290 | 291 | PyTorch makes the use of the GPU explicit and transparent using these commands. Calling `.cuda()` on a model/Tensor/Variable sends it to the GPU. In order to train a model on the GPU, all the relevant parameters and Variables must be sent to the GPU using `.cuda()`. 292 | 293 | ### Painless Debugging 294 | 295 | With its clean and minimal design, PyTorch makes debugging a breeze. You can place breakpoints using `pdb.set_trace()` at any line in your code. You can then execute further computations, examine the PyTorch Tensors/Variables and pinpoint the root cause of the error. 296 | 297 | 298 |
     
    299 | 300 | --- 301 | 302 |
     
    303 | 304 | That concludes the introduction to the PyTorch code examples. You can proceed to the [Vision][pt-vision] example and/or the [NLP][pt-nlp] example to understand how we load data and define models specific to each domain. 305 | 306 | 307 | 308 | [github]: https://github.com/cs230-stanford/cs230-code-examples 309 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 310 | [pt-vision]: https://cs230-stanford.github.io/pytorch-vision.html 311 | [pt-nlp]: https://cs230-stanford.github.io/pytorch-nlp.html 312 | -------------------------------------------------------------------------------- /_posts/2018-02-01-project-code-examples.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Introducing the Project Code Examples" 4 | description: "Introduction and installation" 5 | excerpt: "Introduction and installation" 6 | author: "Teaching assistants Guillaume Genthial, Olivier Moindrot, Surag Nair" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: tensorflow pytorch 11 | github: https://github.com/cs230-stanford/cs230-code-examples 12 | module: Tutorials 13 | --- 14 | 15 | We are happy to introduce the project code examples for CS230. All the code used in the tutorial can be found on the corresponding [github repository][github]. The code has been well commented and detailed, so we recommend reading it entirely at some point if you want to use it for your project. 16 | 17 | The code contains examples for TensorFlow and PyTorch, in vision and NLP. The structure of the repository is the following: 18 | ``` 19 | README.md 20 | pytorch/ 21 | vision/ 22 | nlp/ 23 | tensorflow/ 24 | vision/ 25 | nlp/ 26 | ``` 27 | 28 | This post will help you familiarize with the Project Code Examples, and introduces a series of posts explaining how to structure a deep learning project: 29 | 30 | #### Tensorflow 31 | - [second post][tf-post]: introduction to Tensorflow 32 | - [third post][tf-data]: how to build the data pipeline with tf.data 33 | - [fourth post][tf-model]: how to create and train a model 34 | 35 | #### PyTorch 36 | - [second post][pt-post]: introduction to PyTorch 37 | - [third post][pt-vision]: Vision- predicting labels from images of hand signs 38 | - [fourth post][pt-nlp]: NLP- Named Entity Recognition (NER) tagging for sentences 39 | 40 | __Goals of the code examples__ 41 | 42 | - through these code examples, explain and demonstrate the best practices for structuring a deep learning project 43 | - help students kickstart their project with a working codebase 44 | - in each tensorflow and pytorch, give two examples of projects: one for a vision task, one for a NLP task 45 | 46 | __Table of Content__ 47 | 48 | * TOC 49 | {:toc} 50 | 51 | 52 | --- 53 | 54 | ## Installation 55 | 56 | Each of the four examples (TensorFlow / PyTorch + Vision / NLP) is self-contained and can be used independently of the others. 57 | 58 | Suppose you want to work with TensorFlow on a project involving computer vision. You can first clone the whole github repository and only keep the `tensorflow/vision` folder: 59 | 60 | ```bash 61 | git clone https://github.com/cs230-stanford/cs230-code-examples 62 | cd cs230-code-examples/tensorflow/vision 63 | ``` 64 | 65 | ### Create your virtual environment 66 | It is a good practice to have multiple virtual environments to work on different projects. Here we will use `python3` and install the requirements in the file `requirements.txt`. 67 | 68 | **Installing Python 3**: To use `python3`, make sure to install version 3.5 or 3.6 on your local machine. 69 | If you are on Mac OS X, you can do this using [Homebrew](https://brew.sh) with `brew install python3`. You can find instructions for Ubuntu [here](https://www.digitalocean.com/community/tutorials/how-to-install-python-3-and-set-up-a-local-programming-environment-on-ubuntu-16-04). 70 | 71 | **Virtual environment**: If we don't have it already, install `virtualenv` by typing `sudo pip install virtualenv` (or `pip install --user virtualenv` if you don't have sudo) in your terminal. 72 | Here we create a virtual environment named `.env`. __Navigate inside each example repo and run the following command __ for instance in `tensorflow/nlp` 73 | ```bash 74 | virtualenv -p python3 .env 75 | source .env/bin/activate 76 | pip install -r requirements.txt 77 | ``` 78 | 79 | Run `deactivate` if you want to leave the virtual environment. Next time you want to work on the project, just re-run `source .env/bin/activate` after navigating to the correct directory. 80 | 81 | ### If you have a GPU 82 | 83 | 84 | - for tensorflow, just run `pip install tensorflow-gpu`. When both `tensorflow` and `tensorflow-gpu` are installed, if a GPU is available, `tensorflow` will automatically use it, making it transparent for you to use. 85 | - for PyTorch, follow the instructions [here](http://www.pytorch.org). 86 | 87 | Note that your GPU needs to be set up first (drivers, CUDA and CuDNN). 88 | 89 | 90 | ### Download the data 91 | 92 | __You'll find descriptions of the tasks__ in [`tensorflow/vision/README.md`](https://github.com/cs230-stanford/cs230-code-examples/blob/master/tensorflow/vision/README.md), [`tensorflow/nlp/README.md`](https://github.com/cs230-stanford/cs230-code-examples/blob/master/tensorflow/nlp/README.md) etc. 93 | 94 | #### Vision 95 | 96 | _All instructions can be found in the [`tensorflow/vision/README.md`](https://github.com/cs230-stanford/cs230-code-examples/blob/master/tensorflow/vision/README.md)_ 97 | 98 | For the vision example, we will used the SIGNS dataset created for the Deep Learning Specialization. The dataset is hosted on google drive, download it [here][SIGNS]. 99 | 100 | This will download the SIGNS dataset (~1.1 GB) containing photos of hands signs representing numbers between 0 and 5. 101 | Here is the structure of the data: 102 | ``` 103 | SIGNS/ 104 | train_signs/ 105 | 0_IMG_5864.jpg 106 | ... 107 | test_signs/ 108 | 0_IMG_5942.jpg 109 | ... 110 | ``` 111 | 112 | The images are named following `{label}_IMG_{id}.jpg` where the label is in `[0, 5]`. 113 | The training set contains 1,080 images and the test set contains 120 images. 114 | 115 | Once the download is complete, move the dataset into the `data/SIGNS` folder. Run the script ` python build_dataset.py` which will resize the images to size `(64, 64)`. The new resized dataset will be located by default in `data/64x64_SIGNS`. 116 | 117 | 118 | #### Natural Language Processing (NLP) 119 | 120 | *All instructions can be found in the [`tensorflow/nlp/README.md`](https://github.com/cs230-stanford/cs230-code-examples/blob/master/tensorflow/nlp/README.md)* 121 | 122 | We provide a small subset of the kaggle dataset (30 sentences) for testing in `data/small` but you are encouraged to download the original version on the [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) website. 123 | 124 | 1. __Download the dataset__ `ner_dataset.csv` on [Kaggle](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus/data) and save it under the `nlp/data/kaggle` directory. Make sure you download the simple version `ner_dataset.csv` and NOT the full version `ner.csv`. 125 | 126 | 2. __Build the dataset__ Run the following script 127 | ``` 128 | python build_kaggle_dataset.py 129 | ``` 130 | It will extract the sentences and labels from the dataset, split it into train / test / dev and save it in a convenient format for our model. Here is the structure of the data 131 | ``` 132 | kaggle/ 133 | train/ 134 | sentences.txt 135 | labels.txt 136 | test/ 137 | sentences.txt 138 | labels.txt 139 | dev/ 140 | sentences.txt 141 | labels.txt 142 | ``` 143 | *Debug* If you get some errors, check that you downloaded the right file and saved it in the right directory. If you have issues with encoding, try running the script with python 2.7. 144 | 145 | 3. __Build the vocabulary__ For both datasets, `data/small` and `data/kaggle` you need to build the vocabulary, with 146 | ``` 147 | python build_vocab.py --data_dir data/small 148 | ``` 149 | or 150 | ``` 151 | python build_vocab.py --data_dir data/kaggle 152 | ``` 153 | 154 | --- 155 | 156 | ## Structure of the code 157 | 158 | The code for each example shares a common structure: 159 | ``` 160 | data/ 161 | train/ 162 | dev/ 163 | test/ 164 | experiments/ 165 | model/ 166 | *.py 167 | build_dataset.py 168 | train.py 169 | search_hyperparams.py 170 | synthesize_results.py 171 | evaluate.py 172 | ``` 173 | 174 | Here is each file or directory's purpose: 175 | - `data/`: will contain all the data of the project (generally not stored on github), with an explicit train/dev/test split 176 | - `experiments`: contains the different experiments (will be explained in the following section) 177 | - `model/`: module defining the model and functions used in train or eval. Different for our PyTorch and TensorFlow examples 178 | 179 | 180 | 181 | 182 | 183 | - `build_dataset.py`: creates or transforms the dataset, build the split into train/dev/test 184 | - `train.py`: train the model on the input data, and evaluate each epoch on the dev set 185 | - `search_hyperparams.py`: run `train.py` multiple times with different hyperparameters 186 | - `synthesize_results.py`: explore different experiments in a directory and display a nice table of the results 187 | - `evaluate.py`: evaluate the model on the test set (should be run once at the end of your project) 188 | 189 | --- 190 | 191 | ## Running experiments 192 | 193 | 194 | 195 | Now that you have understood the structure of the code, we can try to train a model on the data, using the `train.py` script: 196 | ```bash 197 | python train.py --model_dir experiments/base_model 198 | ``` 199 | 200 | We need to pass the model directory in argument, where the hyperparameters are stored in a json file named `params.json`. 201 | Different experiments will be stored in different directories, each with their own `params.json` file. Here is an example: 202 | 203 | `experiments/base_model/params.json`: 204 | ```json 205 | { 206 | "learning_rate": 1e-3, 207 | "batch_size": 32, 208 | "num_epochs": 20 209 | } 210 | ``` 211 | 212 | The structure of `experiments` after running a few different models might look like this (try to give meaningful names to the directories depending on what experiment you are running): 213 | ``` 214 | experiments/ 215 | base_model/ 216 | params.json 217 | ... 218 | learning_rate/ 219 | lr_0.1/ 220 | params.json 221 | lr_0.01/ 222 | params.json 223 | batch_norm/ 224 | params.json 225 | ``` 226 | 227 | Each directory after training will contain multiple things: 228 | - `params.json`: the list of hyperparameters, in json format 229 | - `train.log`: the training log (everything we print to the console) 230 | - `train_summaries`: train summaries for TensorBoard (TensorFlow only) 231 | - `eval_summaries`: eval summaries for TensorBoard (TensorFlow only) 232 | - `last_weights`: weights saved from the 5 last epochs 233 | - `best_weights`: best weights (based on dev accuracy) 234 | 235 | 236 | ### Training and evaluation 237 | 238 | We can now train an example model with the parameters provided in the configuration file `experiments/base_model/params.json`: 239 | ```bash 240 | python train.py --model_dir experiments/base_model 241 | ``` 242 | 243 | The console output will look like 244 | 245 | {% include image.html url="/assets/project-code-examples/training.png" description="Training" size="60%" %} 246 | 247 | 248 | 249 | Once training is done, we can evaluate on the test set: 250 | ```bash 251 | python evaluate.py --model_dir experiments/base_model 252 | ``` 253 | 254 | This was just a quick example, so please refer to the detailed [TensorFlow][tf-post] / [PyTorch][pt-post] tutorials for an in-depth explanation of the code. 255 | 256 | 257 | ### Hyperparameters search 258 | 259 | We provide an example that will call `train.py` with different values of learning rate. We first create a directory 260 | ``` 261 | experiments/ 262 | learning_rate/ 263 | params.json 264 | ``` 265 | 266 | with a `params.json` file that contains the other hyperparameters. Then, by calling 267 | 268 | 269 | ``` 270 | python search_hyperparams.py --parent_dir experiments/learning_rate 271 | ``` 272 | 273 | It will train and evaluate a model with different values of learning rate defined in `search_hyperparams.py` and create a new directory for each experiment under `experiments/learning_rate/`, like 274 | 275 | ``` 276 | experiments/ 277 | learning_rate/ 278 | learning_rate_0.001/ 279 | metrics_eval_best_weights.json 280 | learning_rate_0.01/ 281 | metrics_eval_best_weights.json 282 | ... 283 | ``` 284 | 285 | ### Display the results of multiple experiments 286 | 287 | If you want to aggregate the metrics computed in each experiment (the `metrics_eval_best_weights.json` files), simply run 288 | 289 | ``` 290 | python synthesize_results.py --parent_dir experiments/learning_rate 291 | ``` 292 | 293 | It will display a table synthesizing the results like this that is compatible with markdown: 294 | 295 | ``` 296 | | | accuracy | loss | 297 | |:----------------------------------------------|-----------:|----------:| 298 | | experiments/base_model | 0.989 | 0.0550 | 299 | | experiments/learning_rate/learning_rate_0.01 | 0.939 | 0.0324 | 300 | | experiments/learning_rate/learning_rate_0.001 | 0.979 | 0.0623 | 301 | ``` 302 | 303 | --- 304 | 305 | ## Tensorflow or PyTorch ? 306 | 307 | Both framework have their pros and cons: 308 | 309 | __Tensorflow__ 310 | - mature, most of the models and layers are already implemented in the library. 311 | - documented and plenty of code / tutorials online 312 | - the Deep Learning Specialization teaches you how to use Tensorflow 313 | - built for large-scale deployment and used by a lot of companies 314 | - has some very useful tools like Tensorboard for visualization (though you can also use [Tensorboard with PyTorch](https://github.com/lanpa/tensorboard-pytorch)) 315 | - but some ramp-up time is needed to understand some of the concepts (session, graph, variable scope, etc.) -- *(reason why we have code examples that take care of these subtleties)* 316 | - transparent use of the GPU 317 | - can be harder to debug 318 | 319 | __PyTorch__ 320 | - younger, but also well documented and fast-growing community 321 | - more pythonic and numpy-like approach, easier to get used to the dynamic-graph paradigm 322 | - designed for faster prototyping and research 323 | - transparent use of the GPU 324 | - easy to debug and customize 325 | 326 | 327 | Which one will you [choose][matrix] ? 328 | 329 |
    330 |
    331 | 332 |

    333 | PyTorch 334 |

    335 |
    336 |
    337 |
    338 | 339 |

    340 | Tensorflow 341 |

    342 |
    343 |
    344 |
    345 | 346 | 347 | [github]: https://github.com/cs230-stanford/cs230-code-examples 348 | [tf-post]: https://cs230-stanford.github.io/tensorflow-getting-started.html 349 | [pt-post]: https://cs230-stanford.github.io/pytorch-getting-started.html 350 | [pt-vision]: https://cs230-stanford.github.io/pytorch-vision.html 351 | [pt-nlp]: https://cs230-stanford.github.io/pytorch-nlp.html 352 | [tf-data]: https://cs230-stanford.github.io/tensorflow-input-data.html 353 | [tf-model]: https://cs230-stanford.github.io/tensorflow-model.html 354 | 355 | 356 | [SIGNS]: https://drive.google.com/file/d/1ufiR6hUKhXoAyiBNsySPkUwlvE_wfEHC/view?usp=sharing 357 | [matrix]: https://youtu.be/zE7PKRjrid4?t=1m26s 358 | -------------------------------------------------------------------------------- /assets/my-title/img2latex_task.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /_posts/2018-02-01-tensorflow-model.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Create and train a Model" 4 | description: "Create and train a Model in Tensorflow using tf.layers, tf.train, tf.metrics, Tensorboard" 5 | excerpt: "Using tf.layers, tf.train, tf.metrics, Tensorboard" 6 | author: "Teaching assistants Guillaume Genthial and Olivier Moindrot" 7 | date: 2018-01-30 8 | mathjax: true 9 | published: true 10 | tags: tensorflow 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 12 | module: Tutorials 13 | --- 14 | 15 | If you haven't read the previous post, 16 | 17 |

    > Building the data pipeline

    18 | 19 |
    20 | 21 | 22 | This post is part of a series of post explaining how to structure a deep learning project in TensorFlow. 23 | We will explain here how to easily define a deep learning model in TensorFlow using `tf.layers`, and how to train it. 24 | The entire code examples can be found in our [github repository][github]. 25 | 26 | 27 | This tutorial is among a series explaining how to structure a deep learning project: 28 | 29 | - [first post][post-1]: installation, get started with the code for the projects 30 | - [second post][post-2] (TensorFlow only): explain the global structure of the code 31 | - [third post][post-3] (TensorFlow only): how to feed data into the model using `tf.data` 32 | - __this post: how to create the model and train it__ 33 | 34 | __Goals of this tutorial__ 35 | - learn more about TensorFlow 36 | - learn how to easily build models using `tf.layers` 37 | 38 | - ... 39 | 40 | __Table of Content__ 41 | 42 | * TOC 43 | {:toc} 44 | 45 | --- 46 | 47 | ## Defining the model 48 | 49 | Great, now we have this `input` dictionnary containing the Tensor corresponding to the data, let's explain how we build the model. 50 | 51 | 52 | ### Introduction to tf.layers 53 | 54 | This high-level Tensorflow API lets you build and prototype models in a few lines. You can have a look at the [official tutorial for computer vision](https://www.tensorflow.org/tutorials/layers), or at the [list of available layers](https://www.tensorflow.org/api_docs/python/tf/layers). The idea is quite simple so we'll just give an example. 55 | 56 | 57 | Let's get an input Tensor with a similar mechanism than the one explained in the previous part. Remember that __None__ corresponds to the batch dimension. 58 | 59 | ```python 60 | # shape = [None, 64, 64, 3] 61 | images = inputs["images"] 62 | ``` 63 | 64 | Now, let's apply a convolution, a relu activation and a max-pooling. This is as simple as 65 | 66 | ```python 67 | out = images 68 | out = tf.layers.conv2d(out, 16, 3, padding='same') 69 | out = tf.nn.relu(out) 70 | out = tf.layers.max_pooling2d(out, 2, 2) 71 | ``` 72 | 73 | Finally, use this final tensor to predict the labels of the image (6 classes). We first need to reshape the output of the max-pooling to a vector 74 | 75 | ```python 76 | # First, reshape the output into [batch_size, flat_size] 77 | out = tf.reshape(out, [-1, 32 * 32 * 16]) 78 | # Now, logits is [batch_size, 6] 79 | logits = tf.layers.dense(out, 6) 80 | ``` 81 | > Note the use of `-1`: Tensorflow will compute the corresponding dimension so that the total size is preserved. 82 | 83 | The logits will be *unnormalized* scores for each example. 84 | 85 | > In the code examples, the transformation from `inputs` to `logits` is done in the `build_model` function. 86 | 87 | ### Training ops 88 | 89 | 90 | At this point, we have defined the `logits` of the model. We need to define our predictions, our loss, etc. You can have a look at the `model_fn` in `model/model_fn.py`. 91 | 92 | 93 | ```python 94 | # Get the labels from the input data pipeline 95 | labels = inputs['labels'] 96 | labels = tf.cast(labels, tf.int64) 97 | 98 | # Define the prediction as the argmax of the scores 99 | predictions = tf.argmax(logits, 1) 100 | 101 | # Define the loss 102 | loss = tf.losses.sparse_softmax_cross_entropy(labels=labels, logits=logits) 103 | ``` 104 | >The `1` in `tf.argmax` tells Tensorflow to take the argmax on the axis = 1 (remember that axis = 0 is the batch dimension) 105 | 106 | Now, let's use Tensorflow built-in functions to create nodes and operators that will train our model at each iteration ! 107 | 108 | ```python 109 | # Create an optimizer that will take care of the Gradient Descent 110 | optimizer = tf.train.AdamOptimizer(0.01) 111 | 112 | # Create the training operation 113 | train_op = optimizer.minimize(loss) 114 | ``` 115 | > All these nodes are created by `model_fn` that returns a dictionnary `model_spec` containing all the necessary nodes and operators of the graph. This dictionnary will later be used for actually running the training operations etc. 116 | 117 | 118 | And that's all ! Our model is ready to be trained. Remember that all the objects we defined so far are nodes or operators that are part of the Tensorflow graph. To evaluate them, we actually need to execute them in a session. Simply run 119 | 120 | ```python 121 | with tf.Session() as sess: 122 | for i in range(num_batches): 123 | _, loss_val = sess.run([train_op, loss]) 124 | ``` 125 | > Notice how we don't need to feed data to the session as the `tf.data` nodes automatically iterate over the dataset ! 126 | At every iteration of the loop, it will move to the next batch (remember the `tf.data` part), compute the loss, and execute the `train_op` that will perform one update of the weights ! 127 | 128 | 129 | For more details, have a look at the `model/training.py` file that defines the `train_and_evaluate` function. 130 | 131 | 132 | ### Putting input_fn and model_fn together 133 | 134 | 135 | To summarize the different steps, we just give a high-level overview of what needs to be done in `train.py` 136 | 137 | ```python 138 | # 1. Create the iterators over the Training and Evaluation datasets 139 | train_inputs = input_fn(True, train_filenames, train_labels, params) 140 | eval_inputs = input_fn(False, eval_filenames, eval_labels, params) 141 | 142 | # 2. Define the model 143 | logging.info("Creating the model...") 144 | train_model_spec = model_fn('train', train_inputs, params) 145 | eval_model_spec = model_fn('eval', eval_inputs, params, reuse=True) 146 | 147 | # 3. Train the model (where a session will actually run the different ops) 148 | logging.info("Starting training for {} epoch(s)".format(params.num_epochs)) 149 | train_and_evaluate(train_model_spec, eval_model_spec, args.model_dir, params, args.restore_from) 150 | ``` 151 | 152 | The `train_and_evaluate` function performs a given number of epochs (= full pass on the `train_inputs`). At the end of each epoch, it evaluates the performance on the development set (`dev` or `train-dev` in the course material). 153 | 154 | > Remember the discussion about different graphs for Training and Evaluation. Here, notice how the `eval_model_spec` is given the `reuse=True` argument. It will make sure that the nodes of the Evaluation graph which must share weights with the Training graph __do__ share their weights. 155 | 156 | 157 | ### Evalution and tf.metrics 158 | 159 | [Tensorflow doc](https://www.tensorflow.org/api_docs/python/tf/metrics) 160 | 161 | So far, we've explained how we input data to the graph, how we define the different nodes and training ops, but we don't know (yet) how to compute some metrics on our dataset. There are basically 2 possibilities 162 | 163 | 1. __[run evaluation outside the Tensorflow graph]__ Evaluate the prediction over the dataset by running `sess.run(prediction)` and use it to evaluate your model (without Tensorflow, with pure python code). This option can also be used if you need to write a file with all the predicitons and use a script (distributed by a conference for instance) to evaluate the performance of your model. 164 | 2. __[use Tensorflow]__ As the above method can be quite complicated for simple metrics, Tensorflow luckily has some built-in tools to run evaluation. Again, we are going to create nodes and operations in the Graph. The concept is simple: we will use the `tf.metrics` API to build those, the idea being that we need to update the metric on each batch. At the end of the epoch, we can just query the updated metric ! 165 | 166 | 167 | We'll cover method 2 as this is the one we implemented in the code examples (but you can definitely go with option 1 by modifying `model/evaluation.py`). As most of the nodes of the graph, we define these *metrics* nodes and ops in `model/model_fn.py`. 168 | 169 | ```python 170 | # Define the different metrics 171 | with tf.variable_scope("metrics"): 172 | metrics = {'accuracy': tf.metrics.accuracy(labels=labels, predictions=predictions, 173 | 'loss': tf.metrics.mean(loss)} 174 | 175 | # Group the update ops for the tf.metrics, so that we can run only one op to update them all 176 | update_metrics_op = tf.group(*[op for _, op in metrics.values()]) 177 | 178 | # Get the op to reset the local variables used in tf.metrics, for when we restart an epoch 179 | metric_variables = tf.get_collection(tf.GraphKeys.LOCAL_VARIABLES, scope="metrics") 180 | metrics_init_op = tf.variables_initializer(metric_variables) 181 | ``` 182 | > Notice that we define the metrics, a *grouped* update op and an initializer. The use of the `*` in [`tf.group`](https://www.tensorflow.org/api_docs/python/tf/group) is a pythonic way to tell that the argument given to the function corresponds to an optional *positional* argument. 183 | 184 | > Notice also how we define the metrics in a special `variable_scope` so that we can query the variables by name when we create the initializer ! When you create nodes, the variables are added to some pre-defined collections of variables (TRAINABLE_VARIABLES, etc.). The variables we need to reset for `tf.metrics` are in the [`tf.GraphKeys.LOCAL_VARIABLES`](https://www.tensorflow.org/api_docs/python/tf/GraphKeys) collection. Thus, to query the variables, we get the collection of variables in the right scope ! 185 | 186 | Now, to evaluate the metrics on a dataset, we'll just need to run them in a session as we loop over our dataset 187 | 188 | ```python 189 | with tf.Session() as sess: 190 | # Run the initializer to reset the metrics to zero 191 | sess.run(metrics_init_op) 192 | 193 | # Update the metrics over the dataset 194 | for _ in range(num_steps): 195 | sess.run(update_metrics_op) 196 | 197 | # Get the values of the metrics 198 | metrics_values = {k: v[0] for k, v in metrics.items()} 199 | metrics_val = sess.run(metrics_values) 200 | ``` 201 | 202 | And that's all ! If you want to compute new metrics for which you can find a [Tensorflow implementation](https://www.tensorflow.org/api_docs/python/tf/metrics), you can define it in the `model_fn.py` (add it to the `metrics` dictionnary). It will automatically be updated during the training and will be displayed at the end of each epoch. 203 | 204 | --- 205 | 206 | ## Tensorflow Tips and Tricks 207 | 208 | ### Be careful with initialization 209 | 210 | So far, we mentionned 3 different *initializer* operators. 211 | 212 | ```python 213 | # 1. For all the variables (the weights etc.) 214 | tf.global_variables_initializer() 215 | 216 | # 2. For the dataset, so that we can chose to move the iterator back at the beginning 217 | iterator = dataset.make_initializable_iterator() 218 | next_element = iterator.get_next() 219 | iterator_init_op = iterator.initializer 220 | 221 | # 3. For the metrics variables, so that we can reset them to 0 at the beginning of each epoch 222 | metrics_init_op = tf.variables_initializer(metric_variables) 223 | ``` 224 | 225 | During `train_and_evaluate` we perform the following schedule, all in one session 226 | 227 | 1. Loop over the training set, updating the weights and computing the metrics 228 | 2. Loop over the evaluation set, computing the metrics 229 | 3. Go back to step 1. 230 | 231 | We thus need to run 232 | - `tf.global_variable_initializer()` at the very beginning (before the first occurence of step 1) 233 | - `iterator_init_op` at the beginning of every loop (step 1 and step 2) 234 | - `metrics_init_op` at the beginning of every loop (step 1 and step 2), to reset the metrics to zero (we don't want to compute the metrics averaged over the different epochs or different datasets !) 235 | 236 | You can indeed check that this is what we do in `model/evaluation.py` or `model/training.py` when we actually run the graph ! 237 | 238 | ### Saving 239 | 240 | [Official guide](https://www.tensorflow.org/programmers_guide/saved_model) 241 | 242 | Training a model and evaluating is fine, but what about re-using the weights? Also, maybe at some point of the training, our performance started to get worse on the validation set and we want to use the best weights we got during training. 243 | 244 | Saving models is easy in Tensorflow. Look at the outline below 245 | 246 | ```python 247 | # We need to create an instance of saver 248 | saver = tf.train.Saver() 249 | 250 | for epoch in range(10): 251 | for batch in range(10): 252 | _ = sess.run(train_op) 253 | 254 | # Save weights 255 | save_path = os.path.join(model_dir, 'last_weights', 'after-epoch') 256 | saver.save(sess, last_save_path, global_step=epoch + 1) 257 | ``` 258 | 259 | There is not much to say, except that the `saver.save()` method takes a session as input. In our implementation, we use 2 savers. A `last_saver = tf.train.Saver()` that will keep the weights at the end of the last 5 epochs and a `best_saver = tf.train.Saver(max_to_keep=1)` that only keeps one checkpoint corresponding to the weights that achieved the best performance on the validation set ! 260 | 261 | 262 | Later on, to restore the weights of your model, you need to reload the weights thanks to a saver instance, as in 263 | 264 | ```python 265 | with tf.Session() as sess: 266 | # Get the latest checkpoint in the directory 267 | restore_from = tf.train.latest_checkpoint("model/last_weights") 268 | # Reload the weights into the variables of the graph 269 | saver.restore(sess, restore_from) 270 | ``` 271 | 272 | > You can look at the files `model/training.py` and `model/evaluation.py` for more details. 273 | 274 | ### Tensorboard and summaries 275 | 276 | [Official guide](https://www.tensorflow.org/programmers_guide/summaries_and_tensorboard) 277 | 278 | Tensorflow comes with an excellent visualization tool called __Tensorboard__ that enables you to plot different scalars (and much more) in real-time, as you train your model. 279 | 280 | {% include image.html url="/assets/tensorflow-model/tensorboard.png" description="Tensorboard overview" size="80%" %} 281 | 282 | The mechanism of Tensorboard is the following 283 | 1. define some *summaries* (nodes of the graph) that will tell Tensorflow which values we want to plot 284 | 2. evaluate these nodes in the `session` 285 | 3. write the output to a file thanks to a `tf.summary.FileWriter` 286 | 287 | Then, you only need to launch tensorboard in your web-browser by opening a terminal and writing for instance 288 | ``` 289 | tensorboard --logdir="expirements/base_model" 290 | ``` 291 | 292 | Then, navigate to http://127.0.0.1:6006/ and you'll see the different plots. 293 | 294 | In the code examples, we add the summaries in `model/model_fn.py` 295 | 296 | ```python 297 | # Compute different scalars to plot 298 | loss = tf.reduce_mean(losses) 299 | accuracy = tf.reduce_mean(tf.cast(tf.equal(labels, predictions), tf.float32)) 300 | 301 | # Summaries for training 302 | tf.summary.scalar('loss', loss) 303 | tf.summary.scalar('accuracy', accuracy) 304 | ``` 305 | > Note that we don't use the metrics that we defined earlier. The reason being that the `tf.metrics` returns the running average, but Tensorboard already takes care of the smoothing, so we don't want to add any additional smoothing. It's actually rather the opposite: we are interested in real-time progress 306 | 307 | Once these nodes are added to the `model_spec` dictionnary, we need to evaluate them in a session. In our implementation, this is done every `params.save_summary_steps` as you'll notice in the `model/training.py` file. 308 | 309 | 310 | ```python 311 | if i % params.save_summary_steps == 0: 312 | # Perform a mini-batch update 313 | _, _, loss_val, summ, global_step_val = sess.run([train_op, update_metrics, loss, summary_op, global_step]) 314 | # Write summaries for tensorboard 315 | writer.add_summary(summ, global_step_val) 316 | 317 | else: 318 | _, _, loss_val = sess.run([train_op, update_metrics, loss]) 319 | ``` 320 | 321 | You'll notice that we have 2 different writers 322 | 323 | ```python 324 | train_writer = tf.summary.FileWriter(os.path.join(model_dir, 'train_summaries'), sess.graph) 325 | eval_writer = tf.summary.FileWriter(os.path.join(model_dir, 'eval_summaries'), sess.graph) 326 | ``` 327 | 328 | They'll write summaries for both the training and the evaluation, letting you plot both plots on the same graph ! 329 | 330 | 331 | ### A note about the global_step 332 | 333 | [Official doc](https://www.tensorflow.org/api_docs/python/tf/train/Optimizer#minimize) 334 | 335 | In order to keep track of how far we are in the training, we use one of Tensorflow's training utilities, the `global_step`. Once initialized, we give it to the `optimizer.minimize()` as explained below. Thus, each time we will run `sess.run(train_op)`, it will increment the `global_step` by 1. This is very useful for summaries (notice how in the Tensorboard part we give the global step to the `writer`). 336 | 337 | ```python 338 | global_step = tf.train.get_or_create_global_step() 339 | train_op = optimizer.minimize(loss, global_step=global_step) 340 | ``` 341 | 342 | 343 | 344 | 345 | [github]: https://github.com/cs230-stanford/cs230-code-examples 346 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 347 | [post-2]: https://cs230-stanford.github.io/tensorflow-getting-started.html 348 | [post-3]: https://cs230-stanford.github.io/tensorflow-input-data.html 349 | -------------------------------------------------------------------------------- /assets/my-title/seq2seq_vanilla_encoder.svg: -------------------------------------------------------------------------------- 1 | -------------------------------------------------------------------------------- /_posts/2018-02-01-tensorflow-input-data.md: -------------------------------------------------------------------------------- 1 | --- 2 | layout: post 3 | title: "Building a data pipeline" 4 | description: "Tutorial explaining how to use Tensorflow tf.data for text and images" 5 | excerpt: "Using Tensorflow tf.data for text and images" 6 | author: "Teaching assistants Olivier Moindrot and Guillaume Genthial" 7 | date: 2018-01-24 8 | mathjax: true 9 | published: true 10 | tags: tensorflow tf.data 11 | github: https://github.com/cs230-stanford/cs230-code-examples/tree/master/tensorflow 12 | module: Tutorials 13 | --- 14 | If you haven't read the previous post, 15 | 16 |

    > Introduction to Tensorflow

    17 | 18 |
    19 | 20 | 21 | __Motivation__ 22 | 23 | Building the input pipeline in a machine learning project is always long and painful, and can take more time than building the actual model. 24 | In this tutorial we will learn how to use TensorFlow's Dataset module `tf.data` to build efficient pipelines for images and text. 25 | 26 | 27 | 28 | This tutorial is among a series explaining how to structure a deep learning project: 29 | 30 | - [first post][post-1]: installation, get started with the code for the projects 31 | - [second post][post-2]: (TensorFlow) explain the global structure of the code 32 | - __this post: (TensorFlow) how to build the data pipeline__ 33 | - [fourth post][post-4]: (Tensorflow) how to build the model and train it 34 | 35 | __Goals of this tutorial__ 36 | - learn how to use `tf.data` and the best practices 37 | - build an efficient pipeline for loading images and preprocessing them 38 | - build an efficient pipeline for text, including how to build a vocabulary 39 | 40 | __Table of contents__ 41 | 42 | * TOC 43 | {:toc} 44 | 45 | 46 | 47 | --- 48 | 49 | ## An overview of tf.data 50 | 51 | The `Dataset` API allows you to build an asynchronous, highly optimized data pipeline to prevent your GPU from [data starvation](https://www.tensorflow.org/performance/performance_guide#input_pipeline_optimization). 52 | It loads data from the disk (images or text), applies optimized transformations, creates batches and sends it to the GPU. Former data pipelines made the GPU wait for the CPU to load the data, leading to performance issues. 53 | 54 | 55 | Before explaining how `tf.data` works with a simple example, we'll share some great official resources: 56 | - [API docs][api-tf-data] for `tf.data` 57 | - [API docs][api-tf-contrib-data] for `tf.contrib.data`: new features still in beta mode. Contains useful functions that will soon be added to the main `tf.data` 58 | - [Datasets Quick Start][quick-start-tf-data]: gentle introduction to `tf.data` 59 | - [Programmer's guide][programmer-guide-tf-data]: more advanced and detailed guide to the best practices when using Datasets in TensorFlow 60 | - [Performance guide][performance-guide]: advanced guide to improve performance of the data pipeline 61 | - [Official blog post][blog-post-tf-data] introducing Datasets and Estimators. We don't use Estimators in our [code examples][github] so you can safely ignore them for now. 62 | - [Slides from the creator of tf.data][slides] explaining the API, best practices (don't forget to read the speaker notes below the slides) 63 | - [Origin github issue][github-issue-tf-data] for Datasets: a bit of history on the origin of `tf.data` 64 | - [Stackoverflow][stackoverflow] tag for the Datasets API 65 | 66 | ### Introduction to tf.data with a Text Example 67 | 68 | 69 | Let's go over a quick example. Let's say we have a `file.txt` file containing sentences 70 | 71 | ``` 72 | I use Tensorflow 73 | You use PyTorch 74 | Both are great 75 | ``` 76 | 77 | Let's read this file with the `tf.data` API: 78 | 79 | ```python 80 | dataset = tf.data.TextLineDataset("file.txt") 81 | ``` 82 | 83 | Let's try to iterate over it 84 | 85 | ```python 86 | for line in dataset: 87 | print(line) 88 | ``` 89 | 90 | We get an error 91 | ``` 92 | > TypeError: 'TextLineDataset' object is not iterable 93 | ``` 94 | 95 | > Wait... What just happened ? I thought it was supposed to read the data. 96 | 97 | ### Iterators and transformations 98 | 99 | What's really happening is that `dataset` is a node of the Tensorflow `Graph` that contains instructions to read the file. We need to initialize the graph and evaluate this node in a Session if we want to read it. While this may sound awfully complicated, this is quite the oposite : now, even the dataset object is a part of the graph, so you don't need to worry about how to feed the data into your model ! 100 | 101 | We need to add a few things to make it work. First, let's create an `iterator` object over the dataset 102 | 103 | ``` 104 | iterator = dataset.make_one_shot_iterator() 105 | next_element = iterator.get_next() 106 | ``` 107 | > The `one_shot_iterator` method creates an iterator that will be able to iterate once over the dataset. In other words, once we reach the end of the dataset, it will stop yielding elements and raise an Exception. 108 | 109 | Now, `next_element` is a graph's node that will contain the next element of iterator over the Dataset at each execution. Now, let's run it 110 | 111 | ```python 112 | with tf.Session() as sess: 113 | for i in range(3): 114 | print(sess.run(next_element)) 115 | 116 | >'I use Tensorflow' 117 | >'You use PyTorch' 118 | >'Both are great' 119 | ``` 120 | 121 | 122 | Now that you understand the idea behind the `tf.data` API, let's quickly review some more advanced tricks. First, you can easily apply transformations to your dataset. For instance, splitting words by space is as easy as adding one line 123 | ```python 124 | dataset = dataset.map(lambda string: tf.string_split([string]).values) 125 | ``` 126 | 127 | Shuffling the dataset is also straightforward 128 | 129 | ```python 130 | dataset = dataset.shuffle(buffer_size=3) 131 | ``` 132 | 133 | It will load elements 3 by 3 and shuffle them at each iteration. 134 | 135 | You can also create batches 136 | 137 | ``` 138 | dataset = dataset.batch(2) 139 | ``` 140 | 141 | and pre-fetch the data (in other words, it will always have one batch ready to be loaded). 142 | 143 | ``` 144 | dataset = dataset.prefetch(1) 145 | ``` 146 | 147 | Now, let's see what our iterator has become 148 | 149 | ```python 150 | iterator = dataset.make_one_shot_iterator() 151 | next_element = iterator.get_next() 152 | with tf.Session() as sess: 153 | print(sess.run(next_element)) 154 | 155 | >[['Both' 'are' 'great'] 156 | ['You' 'use' 'PyTorch']] 157 | ``` 158 | 159 | and as you can see, we now have a batch created from the shuffled Dataset ! 160 | 161 | __All the nodes in the Graph are assumed to be batched: every Tensor will have `shape = [None, ...]` where None corresponds to the (unspecified) batch dimension__ 162 | 163 | ### Why we use initializable iterators 164 | 165 | As you'll see in the `input_fn.py` files, we decided to use an initializable iterator. 166 | 167 | ```python 168 | dataset = tf.data.TextLineDataset("file.txt") 169 | iterator = dataset.make_initializable_iterator() 170 | next_element = iterator.get_next() 171 | init_op = iterator.initializer 172 | ``` 173 | 174 | Its behavior is similar to the one above, but thanks to the `init_op` we can chose to "restart" from the beginning. This will become quite handy when we want to perform multiple epochs ! 175 | 176 | ```python 177 | with tf.Session() as sess: 178 | # Initialize the iterator 179 | sess.run(init_op) 180 | print(sess.run(next_element)) 181 | print(sess.run(next_element)) 182 | # Move the iterator back to the beginning 183 | sess.run(init_op) 184 | print(sess.run(next_element)) 185 | 186 | > 'I use Tensorflow' 187 | 'You use PyTorch' 188 | 'I use Tensorflow' # Iterator moved back at the beginning 189 | ``` 190 | 191 | > As we use only one session over the different epochs, we need to be able to restart the iterator. Some other approaches (like `tf.Estimator`) alleviate the need of using `initializable` iterators by creating a new session at each epoch. But this comes at a cost: the weights and the graph must be re-loaded and re-initialized with each call to `estimator.train()` or `estimator.evaluate()`. 192 | 193 | 194 | ### Where do I find the data pipeline in the code examples ? 195 | 196 | The `model/input_fn.py` defines a function `input_fn` that returns a dictionnary that looks like 197 | 198 | ```python 199 | images, labels = iterator.get_next() 200 | iterator_init_op = iterator.initializer 201 | 202 | inputs = {'images': images, 'labels': labels, 'iterator_init_op': iterator_init_op} 203 | ``` 204 | 205 | This dictionay of inputs will be passed to the model function, which we will detail in the [next post][post-4]. 206 | 207 | 208 | ## Building an image data pipeline 209 | 210 | 211 | Here is what a Dataset for images might look like. Here we already have a list of `filenames` to jpeg images and a corresponding list of `labels`. We apply the following steps for training: 212 | 213 | 1. Create the dataset from slices of the filenames and labels 214 | 2. Shuffle the data with a buffer size equal to the length of the dataset. This ensures good shuffling (cf. [this answer][stackoverflow-buffer-size]) 215 | 3. Parse the images from filename to the pixel values. Use multiple threads to improve the speed of preprocessing 216 | 4. (Optional for training) Data augmentation for the images. Use multiple threads to improve the speed of preprocessing 217 | 5. Batch the images 218 | 6. Prefetch one batch to make sure that a batch is ready to be served at all time 219 | 220 | 221 | ```python 222 | dataset = tf.data.Dataset.from_tensor_slices((filenames, labels)) 223 | dataset = dataset.shuffle(len(filenames)) 224 | dataset = dataset.map(parse_function, num_parallel_calls=4) 225 | dataset = dataset.map(train_preprocess, num_parallel_calls=4) 226 | dataset = dataset.batch(batch_size) 227 | dataset = dataset.prefetch(1) 228 | ``` 229 | 230 | The `parse_function` will do the following: 231 | - read the content of the file 232 | - decode using jpeg format 233 | - convert to float values in `[0, 1]` 234 | - resize to size `(64, 64)` 235 | 236 | 237 | ```python 238 | def parse_function(filename, label): 239 | image_string = tf.read_file(filename) 240 | 241 | # Don't use tf.image.decode_image, or the output shape will be undefined 242 | image = tf.image.decode_jpeg(image_string, channels=3) 243 | 244 | # This will convert to float values in [0, 1] 245 | image = tf.image.convert_image_dtype(image, tf.float32) 246 | 247 | image = tf.image.resize_images(image, [64, 64]) 248 | return resized_image, label 249 | ``` 250 | 251 | 252 | And finally the `train_preprocess` can be optionally used during training to perform data augmentation: 253 | - Horizontally flip the image with probability 1/2 254 | - Apply random brightness and saturation 255 | 256 | ```python 257 | def train_preprocess(image, label): 258 | image = tf.image.random_flip_left_right(image) 259 | 260 | image = tf.image.random_brightness(image, max_delta=32.0 / 255.0) 261 | image = tf.image.random_saturation(image, lower=0.5, upper=1.5) 262 | 263 | # Make sure the image is still in [0, 1] 264 | image = tf.clip_by_value(image, 0.0, 1.0) 265 | 266 | return image, label 267 | ``` 268 | 269 | 270 | 271 | ## Building a text data pipeline 272 | 273 | Have a look at the Tensorflow seq2seq tutorial using the tf.data pipeline 274 | - [documentation](https://www.tensorflow.org/tutorials/seq2seq) 275 | - [github](https://github.com/tensorflow/nmt/) 276 | 277 | 278 | ### Files format 279 | 280 | We've covered a simple example in the __Overview of tf.data__ section. Now, let's cover a more advanced example. Let's assume that our task is [Named Entity Recognition](https://en.wikipedia.org/wiki/Named-entity_recognition). In other words, our input is a sentence, and our output is a label for each word, like in 281 | 282 | ``` 283 | John lives in New York 284 | B-PER O O B-LOC I-LOC 285 | ``` 286 | 287 | Our dataset will thus need to load both the sentences and the labels. We will store those in 2 different files, a `sentence.txt` file containing the sentences (one per line) and a `labels.txt` containing the labels. For example 288 | 289 | ``` 290 | # sentences.txt 291 | John lives in New York 292 | Where is John ? 293 | ``` 294 | 295 | ``` 296 | # labels.txt 297 | B-PER O O B-LOC I-LOC 298 | O O B-PER O 299 | ``` 300 | 301 | Constructing `tf.data` objects that iterate over these files is easy 302 | 303 | ```python 304 | # Load txt file, one example per line 305 | sentences = tf.data.TextLineDataset("sentences.txt") 306 | labels = tf.data.TextLineDataset("labels.txt") 307 | ``` 308 | 309 | ### Zip datasets together 310 | 311 | At this stage, we might want to iterate over these 2 files *at the same time*. This operation is usually known as a *"zip"*. Luckilly, the `tf.data` comes with such a function 312 | 313 | 314 | ```python 315 | # Zip the sentence and the labels together 316 | dataset = tf.data.Dataset.zip((sentences, labels)) 317 | 318 | # Create a one shot iterator over the zipped dataset 319 | iterator = dataset.make_one_shot_iterator() 320 | next_element = iterator.get_next() 321 | 322 | # Actually run in a session 323 | with tf.Session() as sess: 324 | for i in range(2): 325 | print(sess.run(dataset)) 326 | 327 | > ('John lives in New York', 'B-PER O O B-LOC I-LOC') 328 | ('Where is John ?', 'O O B-PER O') 329 | ``` 330 | 331 | ### Creating the vocabulary 332 | 333 | Great, now we can get the sentence and the labels as we iterate. Let's see how we can transform this string into a sequence of words and then in a sequence of ids. 334 | > Most NLP systems rely on ids as input for the words, meaning that you'll eventually have to convert your sentence into a sequence of ids. 335 | 336 | Here we assume that we ran some script, like `build_vocab.py` that created some vocabulary files in our `/data` directory. We'll need one file for the words and one file for the labels. They will contain one token per line. For instance 337 | 338 | ``` 339 | # words.txt 340 | John 341 | lives 342 | in 343 | ... 344 | ``` 345 | 346 | and 347 | 348 | ``` 349 | # tags.txt 350 | B-PER 351 | B-LOC 352 | ... 353 | ``` 354 | 355 | 356 | Tensorflow has a cool built-in tool to take care of the mapping. We simply define 2 lookup tables 357 | 358 | ```python 359 | words = tf.contrib.lookup.index_table_from_file("data/words.txt", num_oov_buckets=1) 360 | tags = tf.contrib.lookup.index_table_from_file("data/tags.txt") 361 | ```` 362 | > The parameter `num_oov_buckets` specifies the number of buckets created for unknow words. The id will be determined by Tensorflow and we don't have to worry about it. As in most of the cases, we just want to have one id reserved for the out-of-vocabulary words, we just use `num_oov_buckets=1`. 363 | 364 | 365 | Now that we initialized this lookup table, we are going to transform the way we read the files, by adding these extra lines 366 | 367 | ```python 368 | # Convert line into list of tokens, splitting by white space 369 | sentences = sentences.map(lambda string: tf.string_split([string]).values) 370 | 371 | # Lookup tokens to return their ids 372 | sentences = sentences.map(lambda tokens: (words.lookup(tokens), tf.size(tokens))) 373 | ``` 374 | > Be careful that `tf.string_split` returns a `tf.SparseTensor`, that's why we need to extract the `values`. 375 | 376 | ### Creating padded batches 377 | 378 | Great! Now we can iterate and get a list of ids of words and labels for each sentence. We just need to take care of one final thing: __batches__! But here comes a problem: *sentences have different length.* Thus, we need to perform an extra __padding__ operation that will add special token to shorter sentences so that our final batch Tensor is a tensor of shape `[batch_size, max_len_of_sentence_in_the_batch]`. 379 | 380 | We first need to specify the padding shapes and values 381 | 382 | ```python 383 | # Create batches and pad the sentences of different length 384 | padded_shapes = (tf.TensorShape([None]), # sentence of unknown size 385 | tf.TensorShape([None])) # labels of unknown size 386 | 387 | padding_values = (params.id_pad_word, # sentence padded on the right with id_pad_word 388 | params.id_pad_tag) # labels padded on the right with id_pad_tag 389 | ``` 390 | > Note that the padding_values must be in the vocabulary (otherwise we might have a problem later on). That's why we get the id of the special "\" token in `train.py` with `id_pad_word = words.lookup(tf.constant(''))`. 391 | 392 | 393 | Then, we can just use the `tf.data` `padded_batch` method, that takes care of the padding ! 394 | 395 | ```python 396 | # Shuffle the dataset and then create the padded batches 397 | dataset = (dataset 398 | .shuffle(buffer_size=buffer_size) 399 | .padded_batch(32, padded_shapes=padded_shapes, padding_values=padding_values) 400 | ) 401 | ``` 402 | 403 | ### Computing the sentence's size 404 | 405 | Is that all that we need in general ? Not quite. As we mentionned padding, we have to make sure that our model does not take the extra padded-tokens into account when computing its prediction. A common way of solving this issue is to add extra information to our data iterator and give the length of the input sentence as input. Later on, we will be able to give this argument to the `dynamic_rnn` function or create binary masks with `tf.sequence_mask`. 406 | 407 | Look at the `model/input_fn.py` file for more details. But basically, it boils down to adding one line, using `tf.size` 408 | 409 | ```python 410 | sentences = sentences.map(lambda tokens: (vocab.lookup(tokens), tf.size(tokens))) 411 | ``` 412 | 413 | 414 | ### Advanced use - extracting characters 415 | 416 | Now, let's try to perform a more complicated operation. We want to extract characters from each word, maybe because our NLP system relies on characters. Our input is a file that looks like 417 | 418 | ``` 419 | 1 22 420 | 3333 4 55 421 | ``` 422 | 423 | We first create a dataset that yields the words for each sentence, as usual 424 | 425 | ```python 426 | dataset = tf.data.TextLineDataset("file.txt") 427 | dataset = dataset.map(lambda token: tf.string_split([token]).values) 428 | ``` 429 | 430 | Now, we are going to reuse the `tf.string_split` function. However, it outputs a sparse tensor, a convenient data representation in general but which doesn't seem do be supported (yet) by `tf.data`. Thus, we need to convert this `SparseTensor` to a regular `Tensor` 431 | 432 | ```python 433 | def extract_char(token, default_value=""): 434 | # Split characters 435 | out = tf.string_split(token, delimiter='') 436 | # Convert to Dense tensor, filling with default value 437 | out = tf.sparse_tensor_to_dense(out, default_value=default_value) 438 | return out 439 | 440 | # Dataset yields word and characters 441 | dataset = dataset.map(lambda token: (token, extract_char(token))) 442 | ``` 443 | > Notice how we specified a `default_value` to the `tf.sparse_tensor_to_dense` function: words have different lengths, thus the `SparseTensor` that we need to convert has some *unspecified* entries ! 444 | 445 | Creating the padded batches is still as easy as above 446 | 447 | ```python 448 | # Creating the padded batch 449 | padded_shapes = (tf.TensorShape([None]), # padding the words 450 | tf.TensorShape([None, None])) # padding the characters for each word 451 | padding_values = ('', # sentences padded on the right with 452 | '') # arrays of characters padded on the right with 453 | 454 | dataset = dataset.padded_batch(2, padded_shapes=padded_shapes, padding_values=padding_values) 455 | ``` 456 | 457 | and you can test that the output matches your expectations 458 | 459 | ```python 460 | iterator = dataset.make_one_shot_iterator() 461 | next_element = iterator.get_next() 462 | 463 | with tf.Session() as sess: 464 | for i in range(1): 465 | sentences, characters = sess.run(next_element)) 466 | print(sentences[0]) 467 | print(characters[0][1]) 468 | 469 | > ['1', '22', ''] # sentence 1 (words) 470 | ['2', '2', '', ''] # sentence 1 word 2 (chars) 471 | ``` 472 | > Can you explain why we have 2 `` and 1 `` in the first batch ? 473 | 474 | 475 | --- 476 | 477 | ## Best Practices 478 | 479 | One general tip mentioned in [the performance guide][performance-guide] is to put all the data processing pipeline on the CPU to make sure that the GPU is only used for training the deep neural network model: 480 | 481 | ```python 482 | with tf.device('/cpu:0'): 483 | dataset = ... 484 | ``` 485 | 486 | ### Shuffle and repeat 487 | 488 | When training on a dataset, we often need to repeat it for multiple epochs and we need to shuffle it. 489 | 490 | One big caveat when shuffling is to make sure that the `buffer_size` argument is big enough. 491 | The bigger it is, the longer it is going to take to load the data at the beginning. 492 | However a low buffer size can be disastrous for training. Here is a good [answer][stackoverflow-buffer-size] on stackoverflow detailing an example of why. 493 | 494 | The best way to avoid this kind of error might be to split the dataset into train / dev / test in advance and already shuffle the data there (see our other [post][train-dev-test]). 495 | 496 |
    497 | In general, it is good to have the shuffling and repeat at the beginning of the pipeline. For instance if the input to the dataset is a list of filenames, if we directly shuffle after that the buffer of `tf.data.Dataset.shuffle()` will only contain filenames, which is very light on memory. 498 | 499 | When choosing the ordering between shuffle and repeat, you may consider two options: 500 | - __shuffle then repeat__: we shuffle the dataset in a certain way, and repeat this shuffling for multiple epochs (ex: `[1, 3, 2, 1, 3, 2]` for 2 epochs with 3 elements in the dataset) 501 | - __repeat then shuffle__: we repeat the dataset for multiple epochs and then shuffle (ex: `[1, 2, 1, 3, 3, 2]` for 2 epochs with 3 elements in the dataset) 502 | 503 | The second method provides a better shuffling, but you might wait multiple epochs without seeing an example. 504 | The first method makes sure that you always see every element in the dataset at each epoch. 505 | You can also use [`tf.contrib.data.shuffle_and_repeat()`](https://www.tensorflow.org/versions/master/api_docs/python/tf/contrib/data/shuffle_and_repeat) to perform shuffle and repeat. 506 | 507 | 508 | ### Parallelization: using multiple threads 509 | 510 | Parallelization of the data processing pipeline using multiple threads is almost transparent when using the `tf.data` module. We only need to add a `num_parallel_calls` argument to every `dataset.map()` call. 511 | 512 | ```python 513 | num_threads = 4 514 | dataset = dataset.map(parse_function, num_parallel_calls=num_threads) 515 | ``` 516 | 517 | ### Prefetch data 518 | 519 | When the GPU is working on forward / backward propagation on the current batch, we want the CPU to process the next batch of data so that it is immediately ready. 520 | As the most expensive part of the computer, we want the GPU to be fully used all the time during training. 521 | We call this consumer / producer overlap, where the consumer is the GPU and the producer is the CPU. 522 | 523 | With `tf.data`, you can do this with a simple call to `dataset.prefetch(1)` at the end of the pipeline (after batching). 524 | This will always prefetch one batch of data and make sure that there is always one ready. 525 | 526 | ```python 527 | dataset = dataset.batch(64) 528 | dataset = dataset.prefetch(1) 529 | ``` 530 | 531 | In some cases, it can be useful to prefetch more than one batch. For instance if the duration of the preprocessing varies a lot, prefetching 10 batches would average out the processing time over 10 batches, instead of sometimes waiting for longer batches. 532 | 533 | To give a concrete example, suppose than 10% of the batches take 10s to compute, and 90% take 1s. If the GPU takes 2s to train on one batch, by prefetching multiple batches you make sure that we never wait for these rare longer batches. 534 | 535 | 536 | ### Order of the operations 537 | 538 | To summarize, one good order for the different transformations is: 539 | 1. create the dataset 540 | 2. shuffle (with a big enough buffer size) 541 | 3. repeat 542 | 4. map with the actual work (preprocessing, augmentation...) using multiple parallel calls 543 | 5. batch 544 | 6. prefetch 545 | 546 | 547 |
    548 |
    549 |
    550 |
    551 | 552 | Now that we can input data to our model, let's actually see how we define it 553 | 554 | 555 | 556 | 557 | [github]: https://github.com/cs230-stanford/cs230-code-examples 558 | [post-1]: https://cs230-stanford.github.io/project-code-examples.html 559 | [post-2]: https://cs230-stanford.github.io/tensorflow-getting-started.html 560 | [post-4]: https://cs230-stanford.github.io/tensorflow-model.html 561 | [train-dev-test]: https://cs230-stanford.github.io/train-dev-test-split.html 562 | 563 | 564 | [api-tf-data]: https://www.tensorflow.org/api_docs/python/tf/data 565 | [api-tf-contrib-data]: https://www.tensorflow.org/api_docs/python/tf/contrib/data 566 | [quick-start-tf-data]: https://www.tensorflow.org/get_started/datasets_quickstart 567 | [programmer-guide-tf-data]: https://www.tensorflow.org/programmers_guide/datasets 568 | [performance-guide]: https://www.tensorflow.org/performance/performance_guide#input_pipeline_optimization 569 | [blog-post-tf-data]: https://developers.googleblog.com/2017/09/introducing-tensorflow-datasets.html 570 | [slides]: https://docs.google.com/presentation/d/16kHNtQslt-yuJ3w8GIx-eEH6t_AvFeQOchqGRFpAD7U 571 | [github-issue-tf-data]: https://github.com/tensorflow/tensorflow/issues/7951 572 | [stackoverflow]: https://stackoverflow.com/questions/tagged/tensorflow-datasets 573 | [stackoverflow-buffer-size]: https://stackoverflow.com/a/48096625/5098368 574 | --------------------------------------------------------------------------------