├── rfcs ├── 20180821-differentiable-functional-while │ ├── while_v1.png │ ├── while_v2.png │ ├── while_body.png │ └── while_cond.png ├── README.md ├── yyyymmdd-rfc-template.md ├── 20181225-tf-raw-ops.md ├── 20181026-tf-nightly.md ├── 20181025-tf-integration-testing.md ├── 20180604-dynamic-kernels.md ├── 20180507-cond-v2.md ├── 20180731-dockerfile-assembler.md ├── 20180821-differentiable-functional-while.md ├── 20180726-tf-data-windowing-reducers.md ├── 20180817-variables-20.md ├── 20181016-optimizer-unification.md ├── 20181214-move-to-addons.md └── 20181217-tf2-random-numbers.md ├── sigs ├── testing │ └── README.md ├── rust │ └── CHARTER.md ├── build │ ├── CHARTER.md │ ├── community-builds.md │ └── tensorflow-testing.md ├── tensorboard │ └── CHARTER.md ├── networking │ └── CHARTER.md ├── io │ ├── CHARTER.md │ └── RELEASE.md └── addons │ └── CHARTER.md ├── CODEOWNERS ├── README.md ├── MEETINGS.md ├── governance ├── SIG-charter-template.md ├── SIG-request-template.md ├── code-and-collaboration.md ├── tensorflow-testing.md ├── SIGS.md └── TF-RFCs.md ├── CODE_OF_CONDUCT.md ├── CONTRIBUTING.md └── LICENSE /rfcs/20180821-differentiable-functional-while/while_v1.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0101011/community/master/rfcs/20180821-differentiable-functional-while/while_v1.png -------------------------------------------------------------------------------- /rfcs/20180821-differentiable-functional-while/while_v2.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0101011/community/master/rfcs/20180821-differentiable-functional-while/while_v2.png -------------------------------------------------------------------------------- /rfcs/20180821-differentiable-functional-while/while_body.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0101011/community/master/rfcs/20180821-differentiable-functional-while/while_body.png -------------------------------------------------------------------------------- /rfcs/20180821-differentiable-functional-while/while_cond.png: -------------------------------------------------------------------------------- https://raw.githubusercontent.com/0101011/community/master/rfcs/20180821-differentiable-functional-while/while_cond.png -------------------------------------------------------------------------------- /sigs/testing/README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow 2.0 Testing 2 | 3 | Welcome to TF 2.0 testing! This repository will house testing plans, friction logs, and a guide for installing TensorFlow 2.0. 4 | 5 | Thanks for your interest! 6 | 7 | ## Installation Instructions 8 | 9 | ## Community Logistics 10 | 11 | ## Testing Team 12 | -------------------------------------------------------------------------------- /rfcs/README.md: -------------------------------------------------------------------------------- 1 | # TensorFlow RFCs 2 | 3 | This directory stores approved RFCs. 4 | 5 | ## Process 6 | 7 | Please read carefully the [TensorFlow RFC 8 | process](https://github.com/tensorflow/community/blob/master/governance/TF-RFCs.md). 9 | 10 | ## Template 11 | 12 | Use [this template](yyyymmdd-rfc-template.md) 13 | to draft an RFC. 14 | -------------------------------------------------------------------------------- /CODEOWNERS: -------------------------------------------------------------------------------- 1 | # Owners for community repo 2 | # For syntax, see https://help.github.com/articles/about-codeowners/ 3 | 4 | # This file controls who is asked for a review when PRs are submitted 5 | 6 | # SIGS 7 | 8 | sigs/rust/ @adamcrume 9 | sigs/tensorboard/ @ewilderj @manivaradarajan @martinwicke 10 | sigs/build/ @martinwicke @ewilderj @angersson @perfinion 11 | sigs/addons/ @martinwicke @ewilderj @karmel @seanpmorgan @armando-fandango 12 | sigs/networking/ @martinwicke @ewilderj @byronyi @jbedorf @poxvoculi 13 | sigs/io/ @martinwicke @ewilderj @mrry @yongtang @dmitrievanthony 14 | 15 | # RFCs 16 | 17 | rfcs/ @ewilderj @martinwicke @goldiegadde 18 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | # Welcome to the TensorFlow Developer Community 2 | 3 | ## This Repository 4 | 5 | The `community` repository stores documents used by the developer community. 6 | 7 | * `rfcs` - design documents used by the design review process 8 | * `sigs` - documentation for each TensorFlow Special Interest group (SIG) 9 | * `governance` - operating processes for the TensorFlow project 10 | 11 | ## Contact 12 | 13 | For questions about this repository, please file an issue or reach out 14 | to Edd Wilder-James: ewj@google.com. 15 | 16 | ## Further Community Resources 17 | 18 | For a complete overview of the TensorFlow community resources, 19 | please visit [tensorflow.org/community](https://tensorflow.org/community). 20 | -------------------------------------------------------------------------------- /MEETINGS.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Community Meetings 2 | 3 | [TensorFlow SIGs](https://github.com/tensorflow/community/tree/master/sigs) 4 | and other groups hold regular meetings via videoconference. Usually 5 | you will find these added to your calendar after you join the Google Group relevant 6 | to the SIG or community group. 7 | 8 | There is also a master calendar listing all community meetings: 9 | [TensorFlow Community Calendar](https://calendar.google.com/calendar/embed?src=tensorflow.org_14t769n89qhsps949c3l0nhd9c%40group.calendar.google.com). 10 | 11 | Google Calendar users can add the calendar to theirs using the button on the bottom of the master calendar page. If you want 12 | to add the Community Calendar to your own calendar application, use [this iCal link](https://calendar.google.com/calendar/ical/tensorflow.org_14t769n89qhsps949c3l0nhd9c%40group.calendar.google.com/public/basic.ics). 13 | 14 | -------------------------------------------------------------------------------- /sigs/rust/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG-Rust 2 | ## Objective 3 | 4 | For users and contributors to collaborate on the TensorFlow Rust bindings 5 | project. 6 | 7 | ## Membership 8 | 9 | Everyone involved in using or developing the TensorFlow Rust bindings is welcome 10 | to join the group. To participate, join the mailing list. 11 | 12 | Archives of the mailing list will be publicly accessible. 13 | 14 | ## Resources 15 | 16 | * [sig-rust mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/sig-rust) 17 | * Github [tensorflow/rust](https://github.com/tensorflow/rust) 18 | 19 | 20 | ## Contacts 21 | * Project lead: Adam Crume [@adamcrume](https://github.com/adamcrume) - acrume 22 | at google 23 | * For administrative questions, contact Edd Wilder-James 24 | [@ewilderj](https://github.com/ewilderj) - ewj at google 25 | 26 | ## Code of Conduct 27 | 28 | As with all forums and spaces related to TensorFlow, SIG-TensorBoard is subject 29 | to the [TensorFlow Code of 30 | Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 31 | -------------------------------------------------------------------------------- /sigs/build/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG-Build - TensorFlow Distributors & Packagers Group 2 | 3 | ## Objective 4 | 5 | For discussion and collaboration around the building, testing, packaging, and 6 | distribution of TensorFlow. 7 | 8 | ## Membership 9 | 10 | Everyone involved in the building, testing, packaging, distributing or embedding 11 | of TensorFlow is welcome to join the group. To participate, request an invitation 12 | join the mailing list. 13 | 14 | Archives of the mailing list will be publicly accessible. 15 | 16 | ## Resources 17 | 18 | * [sig-build mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/build) 19 | 20 | ## Contacts 21 | 22 | * Project leads: Jason Zaman [@perfinion](https://github.com/perfinion), Austin Anderson [@angersson](https://github.com/angersson) 23 | * For administrative questions, contact Edd Wilder-James @ewilderj - ewj at 24 | google 25 | 26 | ## Code of Conduct 27 | 28 | As with all forums and spaces related to TensorFlow, SIG-Build is subject to the [TensorFlow Code of Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 29 | -------------------------------------------------------------------------------- /governance/SIG-charter-template.md: -------------------------------------------------------------------------------- 1 | # Proposed name: SIG-?????? 2 | 3 | ## Objective 4 | 5 | One or two sentences describing the group's purpose. 6 | 7 | ## Membership 8 | 9 | *Who can join? How can they join? Who can read the group's activity?* 10 | 11 | Example: 12 | 13 | > Everyone involved in the packaging, distributing or embedding of TensorFlow is 14 | > welcome to join the group. To participate, request an invitation to join the 15 | > mailing list. Archives of the mailing list will be publicly accessible. 16 | 17 | ## Resources 18 | 19 | *Links to essential resources: proposed mailing list, Github repo, key documents, etc.* 20 | 21 | ## Contacts 22 | 23 | *Minimum highlight a group leader, and somebody to reach out to for 24 | administrative purposes* 25 | 26 | * *Project lead: A N Other [@githubhandle](https://github.com/githubhandle) - 27 | another at companyname* 28 | * For administrative questions, contact Edd Wilder-James 29 | [@ewilderj](https://github.com/ewilderj) - ewj at google 30 | 31 | ## Code of Conduct 32 | 33 | As with all forums and spaces related to TensorFlow, SIG-?????? is subject to 34 | the [TensorFlow Code of 35 | Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 36 | -------------------------------------------------------------------------------- /sigs/tensorboard/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG TensorBoard 2 | 3 | ## Objective 4 | 5 | For discussion and collaboration around TensorBoard, the visualization tool for TensorFlow. 6 | 7 | ## Membership 8 | 9 | Everyone interested in developing TensorBoard is welcome to join the group. To participate, request an invitation join the mailing list. 10 | 11 | Archives of the mailing list will be publicly accessible. 12 | 13 | ## Resources 14 | 15 | * [SIG TensorBoard mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/sig-tensorboard) 16 | * GitHub [tensorflow/tensorboard](https://github.com/tensorflow/tensorboard) 17 | 18 | * [Writing a TensorBoard plugin](https://github.com/tensorflow/tensorboard-plugin-example/blob/master/README.md) 19 | 20 | ## Contacts 21 | * Project leads: 22 | * Mani Varadarajan [@maniv](https://github.com/manivaradarajan) - maniv at google 23 | * Gal Oshri [@GalOshri](https://github.com/GalOshri) - goshri at google 24 | * For administrative questions, contact Edd Wilder-James [@ewilderj](https://github.com/ewilderj) - ewj at google 25 | 26 | ## Code of Conduct 27 | 28 | As with all forums and spaces related to TensorFlow, SIG-TensorBoard is subject 29 | to the [TensorFlow Code of 30 | Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 31 | -------------------------------------------------------------------------------- /governance/SIG-request-template.md: -------------------------------------------------------------------------------- 1 | # Request for SIG 2 | 3 | ## What is this group for? 4 | 5 | Describe the need the group fills. Who is the audience? 6 | Provide evidence that work is already ongoing in this area. 7 | 8 | ## Who will be part of it? 9 | 10 | Describe: 11 | 12 | * group leader 13 | * a second for the leader 14 | * one or more interested parties who will also be in the group -- provide 15 | evidence of the sustainability of the group 16 | 17 | What will be your membership policy? 18 | 19 | ## What initial problems will the group tackle? 20 | 21 | *List potential goals for the group* 22 | 23 | ## What modes of communication do you intend to use? 24 | 25 | *A mailing list is a minimum. We recommend regularly scheduled VC calls to focus 26 | on agenda items. Slack or other chat channels are optional.* 27 | 28 | ## Launch plan 29 | 30 | *Describe how the group will be launched. Example follows* 31 | 32 | ``` 33 | 1. `VC call with initial interested parties to finalize charter and initial group goals` 34 | 1. `SIG set up with initial group members` 35 | 1. `SIG added to community pages on tensorflow.org` 36 | 1. `Write blog post about SIG and its goals` 37 | 1. `Leader starts off mailing list discussion about initial work items` 38 | ``` 39 | 40 | # Charter 41 | 42 | Please draft the SIG's charter using the [SIG Charter Template](SIG-charter-template.md). 43 | 44 | 45 | -------------------------------------------------------------------------------- /sigs/networking/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG Networking Charter 2 | 3 | ## Objective 4 | 5 | TensorFlow has built-in support for communicating intermediate results across the network using gRPC. SIG Networking aims to add support for different network fabrics and protocols. 6 | 7 | The group evaluates proposals and designs in this area and maintains code in the `tensorflow/networking` repository. 8 | 9 | ## Membership 10 | 11 | Everybody with an interest in making TensorFlow work (better) on differrent types of networks or underlying drivers and libraries is welcome to join the SIG. To participate, request an invitation to join the mailing list. Archives of the mailing list are publicly accessible. 12 | 13 | ## Resources 14 | 15 | * SIG Networking mailing list: [networking@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/networking) 16 | * Repository maintained by SIG Networking: [github.com/tensorflow/networking](https://github.com/tensorflow/networking) 17 | 18 | ## Contacts 19 | 20 | * SIG leads: Bairen Yi [@byronyi](https://github.com/byronyi) - byronyi@clustar.ai, Jeroen Bédorf [@jbedorf](https://github.com/jbedorf) - jeroen@minds.ai 21 | * TensorFlow technical contact: Paul Tucker [@poxvoculi](https://github.com/poxvoculi) - tucker@google.com 22 | * For administrative questions, contact Edd Wilder-James [@ewilderj](https://github.com/ewilderj) - ewj at google 23 | 24 | ## Code of Conduct 25 | 26 | As with all forums and spaces related to TensorFlow, SIG Networking is subject to the [TensorFlow Code of Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 27 | -------------------------------------------------------------------------------- /sigs/io/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG IO - TensorFlow data formats and file systems group 2 | 3 | ## Objective 4 | 5 | TensorFlow has built-in support for accessing a small set of file systems and 6 | data formats. This SIG aims to add support for more file systems and file 7 | formats. 8 | 9 | The group evaluates proposals and designs in this area and maintains code in the 10 | `tensorflow/io` repository. The repository should contain only subclasses of 11 | `tf.data.Dataset` and TensorFlow filesystems, as well as supporting code. 12 | 13 | ## Membership 14 | 15 | Everybody with an interest in improving TensorFlow interoperability is welcome 16 | to join the SIG. To participate, request an invitation to join the mailing list. 17 | Archives of the mailing list are publicly accessible. 18 | 19 | ## Resources 20 | 21 | * SIG IO mailing list: [io@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/io) 22 | * Gitter room: [tensorflow/sig-io](https://gitter.im/tensorflow/sig-io) 23 | * Github repository: [github.com/tensorflow/io](https://github.com/tensorflow/io) 24 | * Python package repository: [tensorflow-io](https://pypi.org/project/tensorflow-io) 25 | * R package repository: [tfio](https://cran.r-project.org/package=tfio) 26 | 27 | ## Releases 28 | 29 | Information about SIG IO releases and the release team could be found in [RELEASE.md](RELEASE.md). 30 | 31 | ## Contacts 32 | 33 | * Project leads: 34 | - Yong Tang [@yongtang](https://github.com/yongtang) - yong.tang.github@outlook.com 35 | - Anthony Dmitriev [@dmitrievanthony](https://github.com/dmitrievanthony) - dmitrievanthony@gmail.com 36 | * TensorFlow technical contact [@mrry](https://github.com/mrry) - mrry@google.com 37 | * For administrative questions, contact Edd Wilder-James 38 | [@ewilderj](https://github.com/ewilderj) - ewj at google 39 | 40 | ## Code of Conduct 41 | 42 | As with all forums and spaces related to TensorFlow, SIG-I/O is subject to the [TensorFlow Code of Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 43 | -------------------------------------------------------------------------------- /rfcs/yyyymmdd-rfc-template.md: -------------------------------------------------------------------------------- 1 | # Title of RFC 2 | 3 | | Status | (Proposed / Accepted / Implemented / Obsolete) | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | My Name (me@example.org), AN Other (you@example.org) | 6 | | **Sponsor** | A N Expert (whomever@tensorflow.org) | 7 | | **Updated** | YYYY-MM-DD | 8 | | **Obsoletes** | TF-RFC it replaces, else remove this header | 9 | 10 | ## Objective 11 | 12 | What are we doing and why? What problem will this solve? What are the goals and 13 | non-goals? This is your executive summary; keep it short, elaborate below. 14 | 15 | ## Motivation 16 | 17 | Why this is a valuable problem to solve? What background information is needed 18 | to show how this design addresses the problem? 19 | 20 | Which users are affected by the problem? Why is it a problem? What data supports 21 | this? What related work exists? 22 | 23 | ## Design Proposal 24 | 25 | This is the meat of the document, where you explain your proposal. If you have 26 | multiple alternatives, be sure to use sub-sections for better separation of the 27 | idea, and list pros/cons to each approach. If there are alternatives that you 28 | have eliminated, you should also list those here, and explain why you believe 29 | your chosen approach is superior. 30 | 31 | Factors to consider include: 32 | 33 | * performance implications 34 | * dependencies 35 | * maintenance 36 | * platforms and environments impacted (e.g. hardware, cloud, other software 37 | ecosystems) 38 | * [compatibility](https://www.tensorflow.org/programmers_guide/version_compat) 39 | * how will this change impact users, and how will that be managed? 40 | 41 | ## Detailed Design 42 | 43 | This section is optional. Elaborate on details if they’re important to 44 | understanding the design, but would make it hard to read the proposal section 45 | above. 46 | 47 | ## Questions and Discussion Topics 48 | 49 | Seed this with open questions you require feedback on from the RFC process. 50 | -------------------------------------------------------------------------------- /rfcs/20181225-tf-raw-ops.md: -------------------------------------------------------------------------------- 1 | # `tf.raw_ops` 2 | 3 | | Status | Accepted | 4 | | :------------ | :------------------------------------------------------ | 5 | | **Author** | apassos@google.com | 6 | | **Sponsor** | wicke@google.com | 7 | | **Updated** | 2018-12-21 | 8 | 9 | ## Objective 10 | 11 | Expose a `tf.raw_ops` namespace containing all raw operations in TensorFlow. 12 | 13 | ## Motivation 14 | 15 | Some parts of the TensorFlow Python API, such as variables, optimizers, and 16 | control flow, are currently not implementable by third parties. Moreover, with 17 | the tf.contrib deprecation, there is now no valid Python endpoint from which to 18 | use many TF operations. 19 | 20 | ## Design Proposal 21 | 22 | We'll add a `tf.raw_ops` namespace to TensorFlow with Python bindings to all 23 | non-deprecated TensorFlow ops which is usable in a backwards-compatible 24 | way. This is designed to be consumed by downstream library writers and not end 25 | users. 26 | 27 | ## Detailed Design 28 | 29 | The namespace will be automatically populated with generated bindings for every 30 | operation in TensorFlow. These generated bindings will be similar to the ones 31 | currently used for the python API, with the following differences: 32 | 33 | * All arguments are keyword arguments. 34 | - This allows us to add new attributes to existing ops without breaking users 35 | who call by positional arguments (given that there is an always-last `name` 36 | argument added by the tf op binding generator). 37 | - This also prevents users from assuming that calling conventions from the 38 | existing python bindings apply to the raw versions (we often do argument 39 | reordering in our python bindings, for example). 40 | * Any op marked as deprecated will be in the namespace but will raise an 41 | exception when used. 42 | - This includes ops which take or produce ref tensors. 43 | - This allows us to deprecate ops eventually and to be less strict with the API 44 | here than with the main API. 45 | - This is mostly OK since only library writers are supposed to use these 46 | symbols, and the deprecation messages should include upgrading instructions. 47 | 48 | 49 | ## Questions and Discussion Topics 50 | 51 | * Naming: tf.raw_ops is the name 52 | * Backward compatibility policy: we'll document on tf.org 53 | * Flat namespace vs nested? flat 54 | * Will not include protocol buffers 55 | -------------------------------------------------------------------------------- /sigs/addons/CHARTER.md: -------------------------------------------------------------------------------- 1 | # SIG Addons 2 | 3 | ## Objective 4 | 5 | TensorFlow natively supports a larger number of operators, layers, metrics, losses, and optimizers. However, in a fast moving field like ML, there are many interesting new developments that cannot be integrated into core TensorFlow (because they are experimental, or their significance is not yet clear). 6 | 7 | This special interest group maintains a repository of bleeding edge contributions that conform to well-established API patterns, but implement new functionality not available in core TensorFlow. 8 | 9 | ## Scope 10 | 11 | This group maintains the [tensorflow/addons](https://github.com/tensorflow/addons) repository. It contains additional functionality which fits the following criteria: 12 | 13 | * The functionality is not otherwise available in TensorFlow 14 | * The functionality conforms to an established API pattern in TensorFlow. For instance, it could be an additional subclass of an existing interface (new Layer, Metric, or Optimizer subclasses), or an additional Op or OpKernel implementation. 15 | * Addons have to be compatible with TensorFlow 2.x. 16 | * The addon conforms to the code and documentation standards defined by the group. These policies are detailed in the project's [README](https://github.com/tensorflow/addons/blob/master/README.md) 17 | * The addon is useful for a large number of users (e.g., an implenentation used in widely cited paper, or a utility with broad applicability) 18 | 19 | The group is responsible for reviewing new additions to the repository, including evaluating designs and implementations. 20 | 21 | ## Membership 22 | 23 | Everybody with an interest in helping extend TensorFlow with new types of Ops, Layers, etc. is welcome to join the SIG. To participate, request an invitation to join the mailing list. Maintainer status for the repository will be conferred by consensus of the existing members. Archives of the mailing list are publicly accessible. 24 | 25 | ## Resources 26 | 27 | * SIG Addons mailing list: [addons@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/addons) 28 | * Repository maintained by SIG Addons: [github.com/tensorflow/addons](https://github.com/tensorflow/addons) 29 | 30 | ## Contacts 31 | 32 | * Project leads: Sean Morgan [@seanpmorgan](https://github.com/seanpmorgan) - seanmorgan@outlook.com, 33 | Armando Fandango [@armando-fandango](https://github.com/armando-fandango) - armando@neurasights.com 34 | * TensorFlow technical contact [@karmel](https://github.com/karmel) - karmel@google.com 35 | * For administrative questions, contact Edd Wilder-James [@ewilderj](https://github.com/ewilderj) - ewj at google 36 | 37 | ## Code of Conduct 38 | 39 | As with all forums and spaces related to TensorFlow, SIG Addons is subject to the [TensorFlow Code of Conduct](https://github.com/tensorflow/tensorflow/blob/master/CODE_OF_CONDUCT.md). 40 | -------------------------------------------------------------------------------- /rfcs/20181026-tf-nightly.md: -------------------------------------------------------------------------------- 1 | # `tf-nightly` and `tf-nightly-gpu` renovations 2 | 3 | | Status | Implemented | 4 | | :------------ | :------------------------------------------------------ | 5 | | **Author** | amitpatankar@google.com | 6 | | **Sponsor** | gunan@google.com | 7 | | **Updated** | 2018-10-26 | 8 | 9 | ## Objective 10 | 11 | Plan a new process and protocol for how we build and distribute nightlies. `tf-nightly` is now widely used to test and evaluate performance as official releases are spread more far apart. If we are following `tf-estimator`'s process for individual module testing we need to make `tf-nightly` more reliable and robust. 12 | 13 | ## Motivation 14 | 15 | Earlier we built from HEAD from the master branch every night for all operating systems. We take the builds for Ubuntu and use them to create our docker containers so that way the git hash matches. Breakages were quite common, and most of our nightly builds were quite behind on all platforms. For example, there was a three month stretch where Windows was not updated. 16 | 17 | ## Design Proposal 18 | 19 | We will take the latest postsubmit build that has passed for each platform and get the commit hash. Based on that hash we will create nightly binaries. If the last postsubmit green is more than 24 hours old we will not publish binaries for that platform for that day. 20 | 21 | Absolutely no tests will be run on the binaries. If it builds it ships. Refer to the diagram below for different cases. 22 | 23 | ![](https://storage.googleapis.com/amitpatankar/tf-nightly-postsubmit.png) 24 | 25 | 26 | ## Detailed Design 27 | 28 | ### Support 29 | * We will still continue to offer packages for: 30 | 31 | |Platform/OS: |CPU |GPU |Package Type | 32 | |---------------|------|------|-------------------------| 33 | |Mac |Yes |No |Pip `Python 2.7-3.6` | 34 | |Ubuntu |Yes |Yes |Pip `Python 2.7-3.6` | 35 | |Windows |Yes |Yes |Pip `Python 2.7-3.6` | 36 | |Docker-dev |Yes |Yes |Container `Python 2&3` | 37 | |Docker-nondev |Yes |Yes |Container `Python 2&3` | 38 | * Please file bugs on [GitHub](https://github.com/tensorflow/tensorflow/issues) if a nightly build for a certain platform has not been pushed for a week. We will do our best to push builds every night, but please wait 7 days before notifying us. 39 | * We will also be much less active for Windows builds especially GPU. We often find that those are difficult to fix, most of `tf-nightly` users are Ubuntu and Docker. That grace period before you can notify us for Windows GPU will be two weeks. 40 | 41 | 42 | ### Versioning 43 | ![](https://storage.googleapis.com/amitpatankar/tf-rename-release-diagram.png) 44 | 45 | ## Questions and Discussion Topics 46 | 47 | * Although the package names for `tf-nightly` and `tensorflow` differ, installing one after the other will overwrite some files in site-packages. 48 | * Hashes may be mismatched. The binary for a certain day on Windows can be a different hash from that corresponding binary on Ubuntu. 49 | * Cannot name them something better due to [PEP440](https://www.python.org/dev/peps/pep-0440/) compliance. -------------------------------------------------------------------------------- /sigs/io/RELEASE.md: -------------------------------------------------------------------------------- 1 | # SIG IO Releases 2 | 3 | At the moment SIG IO Releases consist of two parts: 4 | - Release of source code with versioning in GitHub 5 | - Release of python package in PyPI 6 | - Release of R package to CRAN 7 | 8 | ## GitHub Source Code Release 9 | 10 | To perform a release in GitHub, the following steps are needed: 11 | - Create a PR to update the RELEASE.md in 12 | [github.com/tensorflow/io](https://github.com/tensorflow/io) 13 | * Add updates for new features, enhancements, bug fixes 14 | * Add contributors using `git shortlog ..HEAD -s` 15 | - Merge the PR for RELEASE.md update 16 | - Create a new version through GitHub 17 | 18 | ## PyPI Python Package Release 19 | 20 | To perform a release in PyPI, first complete the above GitHub release, then 21 | build pip packages locally with docker in the following commands 22 | ``` 23 | $ docker run -it -e BAZEL_VERSION=0.20.0 --rm -v ${PWD}:/working_dir \ 24 | -w /working_dir tensorflow/tensorflow:custom-op \ 25 | bash -x /working_dir/.travis/python.release.sh <2.7|3.4|3.5|3.6> 26 | ``` 27 | Note the above commands has to run four times with 2.7, 3.4, 3.5, 3.6 28 | to generate all pip packages for different python versions. 29 | 30 | Then upload `artifacts/*.whl` files with: 31 | ``` 32 | twine upload artifacts/* 33 | ``` 34 | 35 | ## CRAN R Package Release 36 | 37 | Before submitting the R package to CRAN, manually perform and check the following items: 38 | * Make sure the documentation in `README.md` and `vignettes` is up-to-date 39 | * Update `Version` field in `DESCRIPTION` file 40 | * Update `NEWS.md` to include items for this new release 41 | * Run `devtools::check()` and fix all the notable issues, especially warnings and errors 42 | * Update `cran-comments.md` to include any unsolvable issues from `devtools::check()` and 43 | other comments/responses to CRAN maintainers 44 | * Run checks on R-hub via `devtools::check_rhub()` and on win-builder via `devtools::check_win_devel()`. This is 45 | optional since Python is not be installed on CRAN test machines and we skip the tests on 46 | CRAN. 47 | 48 | To submit the package to CRAN for review, do the following: 49 | * Run `devtools::release()` to submit for review. Here's how it looks like if submission is successful: 50 | ``` 51 | Submitting file: /var/folders/zp/k98_wphd0h9c5b3zyk5xhnhm0000gn/T//RtmpHh9Wdo/tfio_0.1.0.tar.gz 52 | File size: 483.4 Kb 53 | Uploading package & comments 54 | Confirming submission 55 | Package submission successful. 56 | Check your email for confirmation link. 57 | ``` 58 | * Check email for confirmation link and confirm the submission 59 | * CRAN maintainers will review the submission and email you for the result of this submission. 60 | If there are any additional issues and comments that need to be addressed, address them and re-submit 61 | 62 | ## SIG IO Release Team 63 | 64 | Everybody with an interest in helping SIG IO releases, is welcome 65 | to join the Release Team. To participate, create a PR to update 66 | the doc or send an email to SIG IO mailing list 67 | [io@tensorflow.org](https://groups.google.com/a/tensorflow.org/forum/#!forum/io). 68 | Please provide both GitHub and PyPI handle to join the release team. 69 | 70 | Current Release Team: 71 | - Yong Tang - GitHub: [@yongtang](https://github.com/yongtang) - PyPI: [yongtang](https://pypi.org/user/yongtang) 72 | - Anthony Dmitriev - GitHub: [@dmitrievanthony](https://github.com/dmitrievanthony) - PyPI: [dmitrievanthony](https://pypi.org/user/dmitrievanthony) 73 | - Yuan (Terry) Tang - GitHub: [@terrytangyuan](https://github.com/terrytangyuan) - PyPI: [terrytangyuan](https://pypi.org/user/terrytangyuan) 74 | - Bryan Cutler - GitHub: [@BryanCutler](https://github.com/BryanCutler) - PyPI: [cutlerb](https://pypi.org/user/cutlerb) 75 | -------------------------------------------------------------------------------- /rfcs/20181025-tf-integration-testing.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Integration Testing 2 | 3 | | Status | Accepted | 4 | | :------------ | :------------------------------------------------------ | 5 | | **Author** | amitpatankar@google.com | 6 | | **Sponsor** | gunan@google.com | 7 | | **Updated** | 2018-10-24 | 8 | 9 | ## Objective 10 | 11 | This document proposes the official way to test projects and repositories downstream from TensorFlow. With TensorFlow becoming more and more modularized, libraries that sit on top of core TensorFlow need to be tested. Unfortunately we cannot wait for any adjustments made to core TensorFlow to propagate through to a formal release and we need a way to have a reliable way of getting the latest stable TensorFlow to test any new changes to the external repositories. A great example is the estimator library which is moving out of TensorFlow, but is still heavily dependent on core TensorFlow changes. 12 | 13 | ## Motivation 14 | 15 | There are three potential possibilities to test TensorFlow dependent libraries: 16 | 17 | * Test with the latest official release. 18 | * Test by building TensorFlow from source at HEAD on the maste branch. 19 | * Test with the old `tf-nightly`. 20 | 21 | |Approach: |TF-Release|TF-Head |Old `tf-nightly`| 22 | |------------------------------|----------|---------|----------------| 23 | |TensorFlow update latency |Poor |Excellent|Average | 24 | |Test setup overhead |Excellent |Poor |Excellent | 25 | |Stability |Excellent |Poor |Poor | 26 | |Test dependencies immediately |Poor |Excellent|Poor | 27 | 28 | None of these solutions are ideal for testing projects downstream from TensorFlow. 29 | 30 | ## Design Proposal 31 | 32 | ### New Testing Approach 33 | 34 | The [renovated `tf-nightly` approach](https://github.com/tensorflow/community/blob/master/rfcs/20181026-tf-nightly.md) will combat the two issues that plague option 3 for testing TensorFlow dependent packages. 35 | 36 | |Approach: |New `tf-nightly` | 37 | |-----------------------------|------------------| 38 | |TensorFlow update latency |Excellent | 39 | |Test setup overhead |Excellent | 40 | |Stability |Excellent | 41 | |Test dependencies immediately|Excellent | 42 | 43 | #### Stability 44 | Sometimes the `tf-nightly` packages were created but failed immediately when attempting `import tensorflow`. 45 | 46 | #### Test dependencies immediately 47 | Sometimes `tf-nightly` packages are behind since there are infrastructure issues or the hash they build off of at midnight does not build. With the guaranteed latest green postsubmit, your test is guaranteed to be run against the latest stable TensorFlow code possibly from the previous day. 48 | 49 | 50 | ### Example testing strategy 51 | Here is a quick example that shows how TensorFlow can work with Tensorboard. This example uses a virtualenv with Python3 to run a simple test that theoretically depends on the latest code from TensorFlow. 52 | 53 | ##### Create the virtual environment 54 | 55 | ```bash 56 | $ virtualenv -p python3 tf 57 | $ source tf/bin/activate 58 | (tf)$ pip install --upgrade pip 59 | ``` 60 | 61 | ##### Install and check `tf-nightly` or `tf-nightly-gpu` 62 | 63 | ```bash 64 | (tf)$ pip install --upgrade tf-nightly 65 | Successfully installed tf-nightly-1.13.0.dev20181023 66 | (tf)$ python -c 'import tensorflow as tf; print(tf.__version__)' 67 | 1.13.0-dev20181023 68 | ``` 69 | 70 | ##### Clone and test the dependent project 71 | 72 | ```bash 73 | (tf)$ git clone git@github.com:tensorflow/tensorboard.git 74 | Cloning into 'tensorboard'... 75 | remote: Counting objects: 20684, done. 76 | remote: Total 20684 (delta 0), reused 0 (delta 0), pack-reused 20683 77 | Receiving objects: 100% (20684/20684), 12.17 MiB | 8.89 MiB/s, done. 78 | Resolving deltas: 100% (15053/15053), done. 79 | (tf)$ cd tensorboard 80 | (tf)$ bazel run //tensorboard/plugins/scalar:scalars_demo 81 | ``` 82 | 83 | 84 | -------------------------------------------------------------------------------- /sigs/build/community-builds.md: -------------------------------------------------------------------------------- 1 | # Community Supported TensorFlow Builds and Releases 2 | 3 | 4 | ## Overview 5 | 6 | TensorFlow is used in many more environments and configurations than is practical for the core team to regularly test and support: so we need a way to include federated third party testing and builds. 7 | 8 | This document describes a process for creating third party builds of TensorFlow, federating tests and builds, and making the build artifacts available to users. Examples of such builds include those optimized for particular hardware configurations, operating system environments, or other specific applications. 9 | 10 | There are three major phases of the process: 11 | 12 | 13 | 14 | 1. Engagement — connect with the TensorFlow core team and work on a plan for integration, tests, documentation and support 15 | 1. Testing — set up continuous integration and connect to GitHub webhooks 16 | 1. Building — once tests exist and pass, and builds are available, they will be linked as community supported builds from the official TensorFlow site 17 | 18 | 19 | ## Phase 1: Engagement 20 | 21 | You should first join the [SIG Build interest group](https://groups.google.com/a/tensorflow.org/forum/#!forum/build): this community is the main way coordination happens around building, testing and releasing TensorFlow. 22 | 23 | To start the process, reach out with a description of your intent to build a particular flavor or release of TensorFlow to the SIG Build community: include a tracking bug filed in GitHub. 24 | 25 | A TensorFlow team member will reply and start the planning process with you. Together, we will create a plan to get the work to "community supported" status. We discuss how to integrate your code, what the TensorFlow team needs from you, and set expectations for both sides. 26 | 27 | In particular we will need to ensure there is: 28 | 29 | 30 | 31 | * A testing plan, to make sure the build is periodically tested by you, with our help. The TensorFlow team won't run these tests. We also will not add tests that will block merging code to the central TensorFlow repository. 32 | * Documentation and examples. You should plan to provide sufficient documentation to let people install, setup, and use the artifacts you have created. 33 | * A support plan. Before we link the build artifacts from the web site, you will need to provide a contact for support and maintenance of the packages. 34 | 35 | The TensorFlow team will periodically review community supported efforts, and highlight them in collaboration with you through various promotional channels on a case-by-case basis: for example, through blog posts or conference presentations. 36 | 37 | 38 | ## Phase 2: Testing 39 | 40 | In this phase, we agree what configurations should be tested based on what the community needs and what you are willing to contribute. Usually, this should be a discussion conducted within SIG Build. 41 | 42 | The TensorFlow team will work with you to set up continuous testing of your build: 43 | 44 | 45 | 46 | * There is no mandated CI system: you can choose what CI system you would like to use (e.g. Jenkins, Travis, custom) 47 | * We recommend running as many unit tests as possible 48 | * Continuous testing of the master branch is required 49 | * Testing release branches at least once after each branch cut is highly recommended 50 | 51 | The TensorFlow team will create "webhooks" in our GitHub repository to enable automated triggering of tests in your CI. 52 | 53 | Once the tests are up and running, we will link to the CI build status under community supported builds on GitHub, as is the case for the IBM CI links [here](https://github.com/tensorflow/tensorflow/blob/master/README.md)! 54 | 55 | 56 | ## Phase 3: Building 57 | 58 | At this stage, we must be sure that the continuous integration is configured, all tests pass, and that the CI setup proves stable. 59 | 60 | You will set up a destination download and documentation site, and the TensorFlow web site will add a link to it, highlighting that this is a community supported build, with credit to you and your organization. 61 | 62 | To be listed as a build, you must also provide: 63 | 64 | 65 | 66 | * One or more GitHub users to assign issues to 67 | * Support details for users to report bugs to you 68 | * Documentation as discussed in Phase 1 69 | 70 | If the build remains broken for an extended period of time, the TensorFlow team may remove it from the community builds list until the requirements for phase 3 are once again met. 71 | 72 | 73 | ## Comments and questions 74 | 75 | Please feel free to ask further about this process on the [build@tensorflow.org](mailto:build@tensorflow.org) mailing list. 76 | 77 | -------------------------------------------------------------------------------- /governance/code-and-collaboration.md: -------------------------------------------------------------------------------- 1 | 2 | # TensorFlow Governance: Code and Collaboration 3 | 4 | ## Projects 5 | 6 | A **project** is the primary unit of collaboration. It can either have its own 7 | repo, or be a part of another repo (e.g. a directory in _tensorflow/models_). 8 | 9 | 10 | ## Contributors 11 | 12 | Anyone can submit a PR contribution to any project, as long as they have signed 13 | the CLA and follow the guidelines in 14 | [CONTRIBUTING.md](https://github.com/tensorflow/tensorflow/blob/master/CONTRIBUTING.md). 15 | Their code must be reviewed by a maintainer, and must pass all applicable tests. 16 | 17 | Code reviews check for code quality and style, including documentation, and 18 | enforce API compatibility guarantees and other policies. Contributions may be 19 | rejected for strategic reasons unrelated to the code in question. For instance, 20 | because a feature may be too costly to maintain, or because it would duplicate 21 | APIs. 22 | 23 | ## Maintainers 24 | 25 | A project has one or more **maintainers**. 26 | 27 | Maintainers have write access to the repo containing their project. That means 28 | they can review PRs—an approving review will allow PRs to be merged. They can 29 | also change labels (which means they can trigger tests), add assignees and 30 | reviewers, and they can be assigned to issues and PRs. 31 | 32 | Note that for some repos being a maintainer may not allow direct commit access, 33 | which is reserved for administrators or bots. *tensorflow/tensorflow* is a case 34 | in point, due to the complexity around releasing, only a small group of release 35 | engineers are administrators. In this case maintainers have approval rights. 36 | 37 | If a repo is shared between many projects, we use GitHub's CODEOWNERS to 38 | identify owners and route PRs to them for review. Because of the way CODEOWNERS 39 | works, it is not possible to use the GitHub routing mechanism everywhere -- for 40 | example, the TensorFlow CODEOWNERS file is mainly for informational purposes, as 41 | to not impede merges. 42 | 43 | Once there are more than a couple of maintainers for a project, we will create a 44 | GitHub team for the project maintainers. This allows for easier maintenance, and 45 | opens up some [GitHub 46 | tooling](https://help.github.com/articles/about-team-discussions/) for 47 | communication. Larger projects can facilitate coordination and contribution 48 | through establishing a 49 | [TensorFlow SIG](SIGS.md). 50 | 51 | 52 | ### Repositories requiring synchronization 53 | 54 | For some projects initiated by Google (including the _tensorflow/tensorflow_ 55 | repo), the infrastructure which synchronizes and merges internal and external 56 | changes requires that all merges are performed by a Google employee. In such 57 | cases, Google sets up an on-call rotation which merges PRs once they pass tests 58 | (and a specific label is applied to the PR in order to notify the rotation to 59 | merge it). This does not preclude non-Google contributors from becoming 60 | maintainers. In this case, the maintainers of the project decide on what should 61 | be merged, then the actual merging is performed as a service. In some cases, 62 | Google-internal tests may fail and may have to be fixed: the Google employee 63 | will work with the submitter to achieve this. 64 | 65 | 66 | ### Achieving maintainer status 67 | 68 | Maintainers may elevate a contributor to maintainer status, on evidence of 69 | previous contributions and established trust. 70 | 71 | ## Collaboration 72 | 73 | Maintainers are free to agree on their preferred form of collaboration and 74 | decision making, with the requirement that regular communication about decisions 75 | must be made publicly accessible—this can happen after the fact, for example in 76 | the form of publishing meeting minutes, reviews, or announcements. Communication 77 | about topics such as admitting other maintainers, or as of yet undisclosed 78 | security issues, can be kept confidential. 79 | 80 | If significant engagement from multiple parties is encountered, the group may 81 | request the formation of a SIG to formalize collaboration and cooperation. The 82 | threshold for SIG formation includes: 83 | 84 | * A clearly stated purpose 85 | * Two or more non-maintainers willing to contribute code, and evidence of 86 | existing demand for the group 87 | * Project maintainers willing to be in the group and shepherd contributions 88 | 89 | For further details on SIGs, read [TensorFlow SIGs](SIGS.md). 90 | 91 | As with most structures, a project doesn't need a SIG to get started, but should 92 | find a home in one if it has proven itself as an ongoing concern, as SIGs are 93 | the primary organizational vehicle for the contributor community. 94 | -------------------------------------------------------------------------------- /governance/tensorflow-testing.md: -------------------------------------------------------------------------------- 1 | # Testing TensorFlow and Reporting Issues 2 | 3 | ## 📢 How to Report Issues 4 | 5 | Over the last few years, and with the extremely productive involvement of our community (_thank you!_), the TensorFlow development team has [reviewed RFCs](https://github.com/tensorflow/community/pulls?utf8=%E2%9C%93&q=is%3Apr+label%3A2.0+), added [many new features](https://www.tensorflow.org/resources/), and implemented most of what will be [TensorFlow 2.0](https://www.tensorflow.org/community/roadmap#tensorflow_20_is_coming) - a significant milestone for the framework, with a focus on ease of use. 6 | 7 | TensorFlow is truly a community effort, and **we would love to have your feedback on how we've been doing so far**, as well as your suggestions for ways that we can improve! 8 | 9 | --------------------------------- 10 | 11 | ## 📝 What is a Good Issue? 12 | 13 | ### 🐞 Report a Bug 14 | 15 | Please submit all bugs, errors, and pecularities on GitHub. Differences between documentation and implementation, lack of 16 | documentation, performance issues, or compatibility problems are all fair game. Please be specific and include all information 17 | that would be helpful to debug the issue using our issue templates: 18 | 19 | * **[Bug / Performance Issue](https://github.com/tensorflow/tensorflow/issues/new?template=00-bug-performance-issue.md)** 20 | * **[Build / Installation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=10-build-installation-issue.md)** 21 | * **[Documentation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=20-documentation-issue.md)** 22 | * **[Other Issue - Not Listed](https://github.com/tensorflow/tensorflow/issues/new?template=50-other-issues.md)** 23 | 24 | If you have a general question, you can [submit it to StackOverflow](https://stackoverflow.com/questions/tagged/tensorflow) with the tag `tensorflow`, or to our [discuss@](https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss) mailing group. Our engineering team tries to answer as many of these questions as possible, but we appreciate help from end users! 25 | 26 | ### ✨ Submit a Feature Request 27 | 28 | As members of the TensorFlow community, your recommendations and suggestions are highly valued, and we are honored to have them. Please submit all feature requests as an issue on GitHub: 29 | 30 | * **[Feature Request](https://github.com/tensorflow/tensorflow/issues/new?template=30-feature-request.md)** 31 | * **[TensorFlow Lite Op Request](https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md)** 32 | 33 | 34 | ### 🤔 Send an Experience Report 35 | 36 | If you would like to submit general feedback about TensorFlow (and in particular, about TensorFlow 2.0), consider submitting a friction log! 37 | 38 | **Friction logs** are documents that describe the frustrations and delights of a product, focused around a specific use case (for example, creating an LSTM model for text classification). They're also intended to be brutally honest - feel free to vent or to praise! 😊 39 | 40 | An template and example of a TensorFlow friction log can be found [here](https://docs.google.com/document/d/1_-0Zzn0hqS4ltLwqWAHm41-MgE60_9zlKyPHr5c-HCs/edit?usp=sharing). 41 | 42 | Once you have completed such a document, please email it to our [testing team](mailto:testing@tensorflow.org). 43 | 44 | --------------------------------- 45 | 46 | ## 🛠 How to Get Involved 47 | 48 | Between now and the preview launch for TensorFlow 2.0, we will be actively maintaining a discussion group for any questions, comments, suggestions, or issues that arise. **We will be holding a weekly stand-up for TF 2.0 testing via Hangouts** that will be announced through the TensorFlow Testing Discussion Group. 49 | 50 | _Please subscribe to [testing@tensorflow.org](http://groups.google.com/a/tensorflow.org/forum/#!forum/testing) to stay up-to-date._ 51 | 52 | ### Special Interest Groups (SIGs) 53 | 54 | TensorFlow's [Special Interest Groups (SIGs)](https://github.com/tensorflow/community/tree/master/sigs) support community collaboration on particular projects. Members of these groups work together to build and support specific parts of TensorFlow or TensorFlow-related projects. 55 | 56 | _To join the discussion on a specific topic, subscribe to one of our SIG mailing lists:_ 57 | 58 | * **[TensorBoard](https://groups.google.com/a/tensorflow.org/d/forum/sig-tensorboard)**: Plug-in development, discussion, and contribution to TensorFlow visualization tooling. 59 | * **[Networking](https://groups.google.com/a/tensorflow.org/d/forum/networking)**: Adding network protocols other than gRPC. 60 | * **[I/O](https://groups.google.com/a/tensorflow.org/d/forum/io)**: Support for file systems and formats not available in core TensorFlow. 61 | * **[Add-ons](https://groups.google.com/a/tensorflow.org/d/forum/addons)**: Extensions to TensorFlow that conform to the stable API. 62 | * **[Build](https://groups.google.com/a/tensorflow.org/d/forum/build)**: Discussion on TensorFlow distribution and packaging. 63 | -------------------------------------------------------------------------------- /sigs/build/tensorflow-testing.md: -------------------------------------------------------------------------------- 1 | # Testing TensorFlow and Reporting Issues 2 | 3 | ## 📢 How to Report Issues 4 | 5 | Over the last few years, and with the extremely productive involvement of our community (_thank you!_), the TensorFlow development team has [reviewed RFCs](https://github.com/tensorflow/community/pulls?utf8=%E2%9C%93&q=is%3Apr+label%3A2.0+), added [many new features](https://www.tensorflow.org/resources/), and implemented most of what will be [TensorFlow 2.0](https://www.tensorflow.org/community/roadmap#tensorflow_20_is_coming) - a significant milestone for the framework, with a focus on ease of use. 6 | 7 | TensorFlow is truly a community effort, and **we would love to have your feedback on how we've been doing so far**, as well as your suggestions for ways that we can improve! 8 | 9 | --------------------------------- 10 | 11 | ## 📝 What is a Good Issue? 12 | 13 | ### 🐞 Report a Bug 14 | 15 | Please submit all bugs, errors, and pecularities on GitHub. Differences between documentation and implementation, lack of 16 | documentation, performance issues, or compatibility problems are all fair game. Please be specific and include all information 17 | that would be helpful to debug the issue using our issue templates: 18 | 19 | * **[Bug / Performance Issue](https://github.com/tensorflow/tensorflow/issues/new?template=00-bug-performance-issue.md)** 20 | * **[Build / Installation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=10-build-installation-issue.md)** 21 | * **[Documentation Issue](https://github.com/tensorflow/tensorflow/issues/new?template=20-documentation-issue.md)** 22 | * **[Other Issue - Not Listed](https://github.com/tensorflow/tensorflow/issues/new?template=50-other-issues.md)** 23 | 24 | If you have a general question, you can [submit it to StackOverflow](https://stackoverflow.com/questions/tagged/tensorflow) with the tag `tensorflow`, or to our [discuss@](https://groups.google.com/a/tensorflow.org/forum/#!forum/discuss) mailing group. Our engineering team tries to answer as many of these questions as possible, but we appreciate help from end users! 25 | 26 | ### ✨ Submit a Feature Request 27 | 28 | As members of the TensorFlow community, your recommendations and suggestions are highly valued, and we are honored to have them. Please submit all feature requests as an issue on GitHub: 29 | 30 | * **[Feature Request](https://github.com/tensorflow/tensorflow/issues/new?template=30-feature-request.md)** 31 | * **[TensorFlow Lite Op Request](https://github.com/tensorflow/tensorflow/issues/new?template=40-tflite-op-request.md)** 32 | 33 | 34 | ### 🤔 Send an Experience Report 35 | 36 | If you would like to submit general feedback about TensorFlow (and in particular, about TensorFlow 2.0), consider submitting a friction log! 37 | 38 | **Friction logs** are documents that describe the frustrations and delights of a product, focused around a specific use case (for example, creating an LSTM model for text classification). They're also intended to be brutally honest - feel free to vent or to praise! 😊 39 | 40 | An template and example of a TensorFlow friction log can be found [here](https://docs.google.com/document/d/1HVG3t-mgGZKU4iMeguTWGejbnQ54qUTXwdCFkA5xHG0/edit?usp=sharing). 41 | 42 | Once you have completed such a document, please email it to our [testing team](mailto:testing@tensorflow.org). 43 | 44 | --------------------------------- 45 | 46 | ## 🛠 How to Get Involved 47 | 48 | Between now and the preview launch for TensorFlow 2.0, we will be actively maintaining a discussion group for any questions, comments, suggestions, or issues that arise. **We will be holding a weekly stand-up for TF 2.0 testing via Hangouts** that will be announced through the TensorFlow Testing Discussion Group. 49 | 50 | _Please subscribe to [testing@tensorflow.org](http://groups.google.com/a/tensorflow.org/forum/#!forum/testing) to stay up-to-date._ 51 | 52 | ### Special Interest Groups (SIGs) 53 | 54 | TensorFlow's [Special Interest Groups (SIGs)](https://github.com/tensorflow/community/tree/master/sigs) support community collaboration on particular projects. Members of these groups work together to build and support specific parts of TensorFlow or TensorFlow-related projects. 55 | 56 | _To join the discussion on a specific topic, subscribe to one of our SIG mailing lists:_ 57 | 58 | * **[TensorBoard](https://groups.google.com/a/tensorflow.org/d/forum/sig-tensorboard)**: Plug-in development, discussion, and contribution to TensorFlow visualization tooling. 59 | * **[Networking](https://groups.google.com/a/tensorflow.org/d/forum/networking)**: Adding network protocols other than gRPC. 60 | * **[I/O](https://groups.google.com/a/tensorflow.org/d/forum/io)**: Support for file systems and formats not available in core TensorFlow. 61 | * **[Add-ons](https://groups.google.com/a/tensorflow.org/d/forum/addons)**: Extensions to TensorFlow that conform to the stable API. 62 | * **[Build](https://groups.google.com/a/tensorflow.org/d/forum/build)**: Discussion on TensorFlow distribution and packaging. 63 | -------------------------------------------------------------------------------- /CODE_OF_CONDUCT.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Code of Conduct 2 | 3 | In the interest of fostering an open and welcoming environment, we as contributors and maintainers pledge to making participation in our project and our community a harassment-free experience for everyone, regardless of age, body size, disability, ethnicity, gender identity and expression, level of experience, nationality, personal appearance, race, religion, or sexual identity and orientation. 4 | 5 | 6 | ## Our Standards 7 | 8 | Examples of behavior that contributes to creating a positive environment include: 9 | 10 | * Using welcoming and inclusive language 11 | * Being respectful of differing viewpoints and experiences 12 | * Gracefully accepting constructive criticism 13 | * Focusing on what is best for the community 14 | * Showing empathy towards other community members 15 | 16 | Examples of unacceptable behavior by participants include: 17 | 18 | * The use of sexualized language or imagery and unwelcome sexual attention or advances 19 | * Trolling, insulting/derogatory comments, and personal or political attacks 20 | * Public or private harassment 21 | * Publishing others' private information, such as a physical or electronic address, without explicit permission 22 | * Conduct which could reasonably be considered inappropriate for the forum in which it occurs. 23 | 24 | All TensorFlow forums and spaces are meant for professional interactions, and any behavior which could reasonably be considered inappropriate in a professional setting is unacceptable. 25 | 26 | 27 | ## Our Responsibilities 28 | 29 | Project maintainers are responsible for clarifying the standards of acceptable behavior and are expected to take appropriate and fair corrective action in response to any instances of unacceptable behavior. 30 | 31 | Project maintainers have the right and responsibility to remove, edit, or reject comments, commits, code, wiki edits, issues, and other contributions that are not aligned to this Code of Conduct, or to ban temporarily or permanently any contributor for other behaviors that they deem inappropriate, threatening, offensive, or harmful. 32 | 33 | 34 | ## Scope 35 | 36 | This Code of Conduct applies to all content on tensorflow.org, TensorFlow’s GitHub organization, or any other official TensorFlow web presence allowing for community interactions, as well as at all official TensorFlow events, whether offline or online. 37 | 38 | The Code of Conduct also applies within project spaces and in public spaces whenever an individual is representing TensorFlow or its community. Examples of representing a project or community include using an official project e-mail address, posting via an official social media account, or acting as an appointed or de facto representative at an online or offline event. 39 | 40 | 41 | ## Conflict Resolution 42 | 43 | Conflicts in an open source project can take many forms, from someone having a bad day and using harsh and hurtful language in the issue queue, to more serious instances such as sexist/racist statements or threats of violence, and everything in between. 44 | 45 | If the behavior is threatening or harassing, or for other reasons requires immediate escalation, please see below. 46 | 47 | However, for the vast majority of issues, we aim to empower individuals to first resolve conflicts themselves, asking for help when needed, and only after that fails to escalate further. This approach gives people more control over the outcome of their dispute. 48 | 49 | If you are experiencing or witnessing conflict, we ask you to use the following escalation strategy to address the conflict: 50 | 51 | 1. Address the perceived conflict directly with those involved, preferably in a real-time medium. 52 | 2. If this fails, get a third party (e.g. a mutual friend, and/or someone with background on the issue, but not involved in conflict) to intercede. 53 | 3. If you are still unable to resolve the conflict, and you believe it rises to harassment or another code of conduct violation, report it. 54 | 55 | 56 | ## Reporting Violations 57 | 58 | Violations of the Code of Conduct can be reported to TensorFlow’s Project Stewards, Edd Wilder-James (ewj@google.com) and Sarah Novotny (sarahnovotny@google.com). The Project Steward will determine whether the Code of Conduct was violated, and will issue an appropriate sanction, possibly including a written warning or expulsion from the project, project sponsored spaces, or project forums. We ask that you make a good-faith effort to resolve your conflict via the conflict resolution policy before submitting a report. 59 | 60 | Violations of the Code of Conduct can occur in any setting, even those unrelated to the project. We will only consider complaints about conduct that has occurred within one year of the report. 61 | 62 | 63 | ## Enforcement 64 | 65 | If the Project Stewards receive a report alleging a violation of the Code of Conduct, the Project Stewards will notify the accused of the report, and provide them an opportunity to discuss the report before a sanction is issued. The Project Stewards will do their utmost to keep the reporter anonymous. If the act is ongoing (such as someone engaging in harassment), or involves a threat to anyone's safety (e.g. threats of violence), the Project Stewards may issue sanctions without notice. 66 | 67 | 68 | ## Attribution 69 | 70 | This Code of Conduct is adapted from the Contributor Covenant, version 1.4, available at https://contributor-covenant.org/version/1/4, and includes some aspects of the Geek Feminism Code of Conduct and the Drupal Code of Conduct. 71 | -------------------------------------------------------------------------------- /governance/SIGS.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Special Interest Groups (SIGs) 2 | 3 | ## What makes a good SIG? 4 | 5 | The ideal scope for a SIG will meet a well-defined domain, where the majority 6 | participation will be from the community. Additionally, there should be 7 | sufficient evidence that there are community members willing to engage and 8 | contribute should the interest group be established. 9 | 10 | Not all SIGs will have the same level of energy, breadth of scope, or governance 11 | models, so we should expect some variability. 12 | 13 | ## Non-goals: What a SIG is not 14 | 15 | The intent of a SIG is to facilitate collaboration on shared work. A SIG is 16 | therefore: 17 | 18 | * **Not a support forum**: a mailing list and a SIG is not the same thing 19 | * **Not immediately required**: early on in a project's life, you may not know if you have shared work or collaborators 20 | * **Not free labor**: energy is required to grow and coordinate the work collaboratively. 21 | 22 | Our approach to SIG creation will be conservative: thanks to the ease of 23 | starting projects on GitHub, there are many avenues where collaboration can 24 | happen without the need for a SIG. 25 | 26 | ## SIG playbook 27 | 28 | ### Research and consultation 29 | 30 | Proposers of groups should gather evidence for approval, as specified below. 31 | Some possible avenues to consider are: 32 | 33 | * A well-defined problem or set of problems the group would solve 34 | * Consultation with community members who would benefit, assessing both the 35 | benefit and their willingness to commit 36 | * For existing projects, evidence from issues and PRs that contributors care 37 | about the topic 38 | * Potential goals for the group to achieve 39 | * Resource requirements of running the group 40 | 41 | Even if the need for a SIG seems self-evident, the research and consultation is 42 | still important to the success of the group. 43 | 44 | ### Creating the new group 45 | 46 | The new group should follow the below process for chartering. In particular, it 47 | must demonstrate: 48 | 49 | * A clear purpose and benefit to TensorFlow (either around a sub-project or 50 | application area) 51 | * Two or more contributors willing to act as group leads, existence of other 52 | contributors, and evidence of demand for the group 53 | * Resources it will initially require (usually, mailing list and regular VC 54 | call.) 55 | 56 | Approval for the group will be given by a decision of the TF Community Team, 57 | defined as being the maintainers of the _tensorflow/community_ project. The team 58 | will consult other stakeholders as necessary. 59 | 60 | Before entering the formal parts of the process, it is advisable to consult with 61 | the TensorFlow community team, *community-team@tensorflow.org*. It is highly 62 | likely that conversation and iteration will be required before the SIG request 63 | is ready. 64 | 65 | The formal request for the new group is done by submitting a charter as a PR to 66 | _tensorflow/community_, and including the request in the comments on the PR (see 67 | template below). On approval, the PR for the group will be merged and the 68 | required resources created. 69 | 70 | ### Template Request for New SIG 71 | 72 | This template will be available in the community repo: [SIG-request-template.md](SIG-request-template.md). 73 | 74 | ### Chartering 75 | 76 | Each group will be established with a charter, and be governed by the TensorFlow 77 | code of conduct. Archives of the group will be public. Membership may either be 78 | open to all without approval, or available on request, pending approval of the 79 | group administrator. 80 | 81 | The charter must nominate an administrator. As well as an administrator, the 82 | group must include at least one person as lead (these may be the same person), 83 | who will serve as point of contact for coordination as required with the TF 84 | community team. 85 | 86 | This charter will be posted initially to the group mailing list. The _community_ 87 | repository in the TensorFlow Github organization will archive such documents and 88 | policies ([example from Kubernetes](https://github.com/kubernetes/community)). 89 | As any group evolves its practices and conventions, we expect it to document 90 | these within the relevant part of the community repository. 91 | 92 | ### Collaboration and inclusion 93 | 94 | While it is not mandated, the group should choose to make use of collaboration 95 | via scheduled conference call or chat channels to conduct meetings. Any such 96 | meetings should be advertised on the mailing list, and notes posted to the 97 | mailing list afterwards. Regular meeting helps drive accountability and progress 98 | in a SIG. 99 | 100 | TensorFlow community team members will proactively monitor and encourage the 101 | group to discussion and action as appropriate. 102 | 103 | ### Launching 104 | 105 | Required activities: 106 | 107 | * Notifying TensorFlow general mailing lists (discuss@, developers ML) of new group 108 | * Adding SIG to the community pages on TensorFlow web site 109 | 110 | Optional activities: 111 | 112 | * Creating a blog post for the TensorFlow Medium.com blog community 113 | 114 | ### Health and termination of SIGs 115 | 116 | The TF community team will make best effort to ensure the health of SIGs. From 117 | time to time it will request the SIG lead to provide a report of the SIG's work, 118 | which will be used to inform the broader TensorFlow community of the activity of 119 | the group. 120 | 121 | If a SIG no longer has a useful purpose or interested community, it may be 122 | archived and cease operation. The TF community team reserves the right to 123 | archive such inactive SIGs, in order to maintain the health of the project at 124 | large, though it is a less preferable outcome. A SIG may also opt to disband if 125 | it recognizes it has reached the end of its useful life. 126 | -------------------------------------------------------------------------------- /governance/TF-RFCs.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Request for Comments (TF-RFC) 2 | 3 | The purpose of a TensorFlow RFC is to engage the TensorFlow community in 4 | development, by getting feedback from stakeholders and experts, and 5 | communicating design changes broadly. 6 | 7 | ## Who is involved? 8 | 9 | Any **community member** may help by providing feedback on whether the RFC will 10 | meet their needs. 11 | 12 | An **RFC author** is one or more community member who writes an RFC and is 13 | committed to championing it through the process. 14 | 15 | An **RFC sponsor** is any maintainer who sponsors the RFC and will shepherd it 16 | through the RFC review process. 17 | 18 | A **review committee** is a group of maintainers who have the responsibility of 19 | recommending the adoption of the RFC. 20 | 21 | ## What is a TensorFlow RFC? 22 | 23 | An RFC is a document that describes a requirement and the proposed changes that 24 | will solve it. Specifically, the RFC will: 25 | 26 | * be formatted according to the RFC template 27 | * be submitted as a pull request to the 28 | [community/rfcs](https://github.com/tensorflow/community/tree/master/rfcs) directory 29 | * be subject to discussion and a review meeting prior to acceptance 30 | 31 | ## RFC process 32 | 33 | Before submitting an RFC, it is a good idea to discuss your aims with project 34 | contributors and maintainers and get early feedback. Use the developer mailing 35 | list for the project concerned (developers@tensorflow.org, or the list for the 36 | relevant SIG). After writing the RFC draft, get feedback from these 37 | experts before submitting it. 38 | 39 | 1. Recruit a sponsor from the maintainers of the project which your RFC concerns. 40 | 41 | Identify them in the RFC, before posting the PR in step 2. 42 | If no sponsor is found you may still post the RFC, but if 43 | within a month of posting the PR there is still no sponsor, 44 | it will be closed. 45 | 46 | 2. Submit your RFC as a pull request to community/rfcs. 47 | 48 | Name your RFC file using the [template](https://github.com/tensorflow/community/blob/master/rfcs/yyyymmdd-rfc-template.md) `YYYYMMDD-descriptive-name.md`, where 49 | YYYYMMDD is the date of submission, and ‘descriptive-name’ relates to the 50 | title of your RFC. For instance, if your RFC is titled “Parallel Widgets API”, 51 | you might use the filename `20180531-parallel-widgets.md`. If you have images 52 | or other auxiliary files, create a directory of the form `YYYYMMDD-descriptive-name` 53 | in which to store those files. 54 | 55 | Include the header table and the contents of the **Objective** section 56 | in the comment of your pull request, using Markdown. For an example, 57 | please see [this example 58 | RFC](https://github.com/tensorflow/community/pull/5). Include a mention 59 | of any of the GitHub handles of co-authors, reviewers, and sponsors. 60 | 61 | At the top of the PR identify how long the comment period will be. This 62 | should be a minimum of two weeks from posting the PR. 63 | 64 | 3. Email the developer mailing list with a brief description, and a link to the 65 | PR and a request for review. Follow the example of previous mailings, 66 | as you can see in [this 67 | example](https://groups.google.com/a/tensorflow.org/forum/#!topic/developers/PIChGLLnpTE). 68 | 69 | 4. The sponsor will request a review committee meeting, no sooner than two weeks 70 | after the RFC PR is posted. If discussion is lively, wait until it has 71 | settled before going to review. The goal of the review meeting is to resolve 72 | minor issues; consensus should be reached on major issues beforehand. 73 | 74 | 5. The meeting may approve the RFC, reject it, or require changes before it 75 | can be considered again. Approved RFCs will be merged into community/rfcs, and 76 | rejected RFCs will have their PRs closed. 77 | 78 | 6. Implementations of a successful RFC should reference it in their 79 | documentation, and work with the sponsor to successfully land the code. 80 | 81 | While implementation code is not necessary to start the RFC process, its 82 | existence in full or part may help the design discussion. 83 | 84 | If in any doubt about this process, feel free to ask on the 85 | developers mailing list or file an issue in tensorflow/community. 86 | 87 | ## Community members 88 | 89 | As the purpose of RFCs is to ensure the community is well represented and served 90 | by new changes to TensorFlow, it is the responsibility of community members to 91 | participate in reviewing RFCs where they have an interest in the outcome. 92 | 93 | Community members should: 94 | 95 | * provide feedback as soon as possible to allow adequate time for consideration 96 | * read RFCs thoroughly before providing feedback 97 | * be civil and constructive 98 | 99 | ## Review committees 100 | 101 | The constitution of a review committee may change according to the particular 102 | governance style and leadership of each project. For core TensorFlow, the 103 | committee will exist of contributors to the TensorFlow project, who have 104 | expertise in the domain area concerned. 105 | 106 | Review committees must: 107 | 108 | * ensure that substantive items of public feedback have been accounted for 109 | * add their meeting notes as comments to the PR 110 | * provide reasons for their decisions 111 | 112 | If a review committee requires changes before acceptance, it is the 113 | responsibility of the sponsor to ensure these are made and seek subsequent 114 | approval from the committee members. 115 | 116 | ## RFC sponsors 117 | 118 | A sponsor is a project maintainer responsible for ensuring the best possible 119 | outcome of the RFC process. In particular this includes: 120 | 121 | * advocating for the proposed design 122 | * guiding the RFC to adhere to existing design and style conventions 123 | * guiding the review committee to come to a productive consensus 124 | * if the RFC moves to implementation: 125 | * ensuring proposed implementation adheres to the design 126 | * liaison with appropriate parties to successfully land implementation 127 | 128 | ## Keeping the bar high 129 | 130 | While we encourage and celebrate every contributor, the bar for RFC acceptance 131 | should be kept intentionally high. A design may be rejected or need significant 132 | revision at any one of these stages: 133 | 134 | * initial design conversations on the relevant mailing list 135 | * failure to recruit a sponsor 136 | * critical objections during the feedback phase 137 | * failure to achieve consensus during the design review 138 | * concerns raised during implementation (e.g., inability to achieve backwards 139 | compatibility, concerns about maintenance appearing once a partial implementation 140 | is available) 141 | 142 | If this process is functioning well, RFCs are expected to fail in the earlier, 143 | rather than later, stages. 144 | 145 | An approved RFC is no guarantee of a commitment to implement, and acceptance of 146 | a proposed RFC implementation is still subject to the usual code review 147 | process. 148 | 149 | ## RFC Template 150 | 151 | Use the template [from 152 | GitHub](https://github.com/tensorflow/community/blob/master/rfcs/yyyymmdd-rfc-template.md), 153 | being sure to follow the naming conventions described above. 154 | -------------------------------------------------------------------------------- /CONTRIBUTING.md: -------------------------------------------------------------------------------- 1 | # Contributing guidelines 2 | 3 | ## How to become a contributor and submit your own code 4 | 5 | ### Contributor License Agreements 6 | 7 | We'd love to accept your patches! Before we can take them, we have to jump a couple of legal hurdles. 8 | 9 | Please fill out either the individual or corporate Contributor License Agreement (CLA). 10 | 11 | * If you are an individual writing original source code and you're sure you own the intellectual property, then you'll need to sign an [individual CLA](https://code.google.com/legal/individual-cla-v1.0.html). 12 | * If you work for a company that wants to allow you to contribute your work, then you'll need to sign a [corporate CLA](https://code.google.com/legal/corporate-cla-v1.0.html). 13 | 14 | Follow either of the two links above to access the appropriate CLA and instructions for how to sign and return it. Once we receive it, we'll be able to accept your pull requests. 15 | 16 | ***NOTE***: Only original source code from you and other people that have signed the CLA can be accepted into the main repository. 17 | 18 | ### Contributing code 19 | 20 | If you have improvements to TensorFlow, send us your pull requests! For those 21 | just getting started, Github has a [howto](https://help.github.com/articles/using-pull-requests/). 22 | 23 | TensorFlow team members will be assigned to review your pull requests. Once the pull requests are approved and pass continuous integration checks, we will merge the pull requests. 24 | For some pull requests, we will apply the patch for each pull request to our internal version control system first, and export the change out as a new commit later, at which point the original pull request will be closed. The commits in the pull request will be squashed into a single commit with the pull request creator as the author. These pull requests will be labeled as pending merge internally. 25 | 26 | If you want to contribute but you're not sure where to start, take a look at the 27 | [issues with the "contributions welcome" label](https://github.com/tensorflow/tensorflow/labels/stat%3Acontributions%20welcome). 28 | These are issues that we believe are particularly well suited for outside 29 | contributions, often because we probably won't get to them right now. If you 30 | decide to start on an issue, leave a comment so that other people know that 31 | you're working on it. If you want to help out, but not alone, use the issue 32 | comment thread to coordinate. 33 | 34 | ### Contribution guidelines and standards 35 | 36 | Before sending your pull request for 37 | [review](https://github.com/tensorflow/tensorflow/pulls), 38 | make sure your changes are consistent with the guidelines and follow the 39 | TensorFlow coding style. 40 | 41 | #### General guidelines and philosophy for contribution 42 | 43 | * Include unit tests when you contribute new features, as they help to 44 | a) prove that your code works correctly, and b) guard against future breaking 45 | changes to lower the maintenance cost. 46 | * Bug fixes also generally require unit tests, because the presence of bugs 47 | usually indicates insufficient test coverage. 48 | * Keep API compatibility in mind when you change code in core TensorFlow, 49 | e.g., code in [tensorflow/core](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/core) and [tensorflow/python](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/python). 50 | TensorFlow has reached version 1 and hence cannot make 51 | non-backward-compatible API changes without a major release. Reviewers of your 52 | pull request will comment on any API compatibility issues. 53 | * When you contribute a new feature to TensorFlow, the maintenance burden is (by 54 | default) transferred to the TensorFlow team. This means that benefit of the 55 | contribution must be compared against the cost of maintaining the feature. 56 | * Full new features (e.g., a new op implementing a cutting-edge algorithm) 57 | typically will live in 58 | [tensorflow/contrib](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/contrib) 59 | to get some airtime before decision is made regarding whether they are to be 60 | migrated to the core. 61 | 62 | #### License 63 | 64 | Include a license at the top of new files. 65 | 66 | * [C/C++ license example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/framework/op.cc#L1) 67 | * [Python license example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/ops/nn.py#L1) 68 | * [Java license example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/java/src/main/java/org/tensorflow/Graph.java#L1) 69 | * [Go license example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/go/operation.go#L1) 70 | * [Bash license example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/ci_build/ci_sanity.sh#L2) 71 | * [HTML license example](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/components/tf_backend/tf-backend.html#L2) 72 | * [JavaScript/TypeScript license example](https://github.com/tensorflow/tensorboard/blob/master/tensorboard/components/tf_backend/backend.ts#L1) 73 | 74 | Bazel BUILD files also need to include a license section, e.g., 75 | [BUILD example](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/BUILD#L61). 76 | 77 | #### C++ coding style 78 | 79 | Changes to TensorFlow C++ code should conform to 80 | [Google C++ Style Guide](https://google.github.io/styleguide/cppguide.html). 81 | 82 | Use `clang-tidy` to check your C/C++ changes. To install clang-tidy on ubuntu:16.04, do: 83 | 84 | ```bash 85 | apt-get install -y clang-tidy 86 | ``` 87 | 88 | You can check a C/C++ file by doing: 89 | 90 | 91 | ```bash 92 | clang-format --style=google > /tmp/my_cc_file.cc 93 | diff /tmp/my_cc_file.cc 94 | ``` 95 | 96 | #### Python coding style 97 | 98 | Changes to TensorFlow Python code should conform to 99 | [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html) 100 | 101 | Use `pylint` to check your Python changes. To install `pylint` and 102 | retrieve TensorFlow's custom style definition: 103 | 104 | ```bash 105 | pip install pylint 106 | wget -O /tmp/pylintrc https://raw.githubusercontent.com/tensorflow/tensorflow/master/tensorflow/tools/ci_build/pylintrc 107 | ``` 108 | 109 | To check a file with `pylint`: 110 | 111 | ```bash 112 | pylint --rcfile=/tmp/pylintrc myfile.py 113 | ``` 114 | 115 | #### Coding style for other languages 116 | 117 | * [Google Java Style Guide](https://google.github.io/styleguide/javaguide.html) 118 | * [Google JavaScript Style Guide](https://google.github.io/styleguide/jsguide.html) 119 | * [Google Shell Style Guide](https://google.github.io/styleguide/shell.xml) 120 | * [Google Objective-C Style Guide](https://google.github.io/styleguide/objcguide.html) 121 | 122 | #### Running sanity check 123 | 124 | If you have Docker installed on your system, you can perform a sanity check on 125 | your changes by running the command: 126 | 127 | ```bash 128 | tensorflow/tools/ci_build/ci_build.sh CPU tensorflow/tools/ci_build/ci_sanity.sh 129 | ``` 130 | 131 | This will catch most license, Python coding style and BUILD file issues that 132 | may exist in your changes. 133 | 134 | #### Running unit tests 135 | 136 | There are two ways to run TensorFlow unit tests. 137 | 138 | 1. Using tools and libraries installed directly on your system. 139 | 140 | Refer to the 141 | [CPU-only developer Dockerfile](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/Dockerfile.devel) and 142 | [GPU developer Dockerfile](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/tools/docker/Dockerfile.devel-gpu) 143 | for the required packages. Alternatively, use the said 144 | [Docker images](https://hub.docker.com/r/tensorflow/tensorflow/tags/), e.g., 145 | `tensorflow/tensorflow:nightly-devel` and `tensorflow/tensorflow:nightly-devel-gpu` 146 | for development to avoid installing the packages directly on your system. 147 | 148 | Once you have the packages installed, you can run a specific unit test in 149 | bazel by doing as follows: 150 | 151 | If the tests are to be run on GPU, add CUDA paths to LD_LIBRARY_PATH and add 152 | the `cuda` option flag 153 | 154 | ```bash 155 | export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:/usr/local/cuda/lib64:/usr/local/cuda/extras/CUPTI/lib64:$LD_LIBRARY_PATH" 156 | 157 | export flags="--config=opt --config=cuda -k" 158 | ``` 159 | 160 | For example, to run all tests under tensorflow/python, do: 161 | 162 | ```bash 163 | bazel test ${flags} //tensorflow/python/... 164 | ``` 165 | 166 | 2. Using [Docker](https://www.docker.com) and TensorFlow's CI scripts. 167 | 168 | ```bash 169 | # Install Docker first, then this will build and run cpu tests 170 | tensorflow/tools/ci_build/ci_build.sh CPU bazel test //tensorflow/... 171 | ``` 172 | 173 | See 174 | [TensorFlow Builds](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/ci_build) for details. 175 | 176 | -------------------------------------------------------------------------------- /rfcs/20180604-dynamic-kernels.md: -------------------------------------------------------------------------------- 1 | # Dynamic Loading of Kernels in TensorFlow 2 | | Status | Accepted | 3 | :-------------- | :-------------------------------------------------| 4 | | **Author(s)** | Gunhan Gulsoy (Google) | 5 | | **Sponsor** | Martin Wicke (Google) | 6 | | **Updated** | 2018-06-04 | 7 | 8 | ## Objective 9 | This document describes a new way to create and deploy new kernels for 10 | TensorFlow. We propose deploying kernels in separate shared libraries (dso, 11 | dylib or dll) and loading these at runtime. While at the moment the scope of 12 | this document only covers **TensorFlow Python distribution**, we aim to 13 | generalize this approach for all TF distributions. With this mechanism, we 14 | would like to create the following capabilities: 15 | * Loading kernels dynamically at runtime from shared libraries. 16 | * Being able to load multiple kernels for the same op/device pair, and pick the 17 | best one in terms of hardware compatibility and performance. 18 | * Check the hardware and load the compatible kernels. 19 | * Check compiler options used and load the compatible kernels. 20 | 21 | ## Overview 22 | For an Op, we need three pieces: 23 | * Python bindings, to make them accessible in the Python API 24 | * C++ op implementation 25 | * C++ Kernel implementation(s) 26 | 27 | This document proposes a new way on how **kernels** can be deployed and loaded. 28 | 29 | In the current mechanism, the only constraint is Python bindings have to be 30 | executed/loaded after C++ op implementation is loaded. Kernels can be loaded at 31 | any time. This makes our task easier. When a kernel is loaded, it registers 32 | itself in the global registry with a string key. The string key is constructed 33 | as follows: `op_name:device_name:(optional)label` 34 | 35 | To start this project off, what we propose is the following: 36 | * Create a new API, `tf.load_kernel_library` 37 | * Use the new API to load kernels from a different shared object. 38 | 39 | Then, we will start to build checks, to be more picky about the kernels we load. 40 | * Build handling for loading multiple kernels for the same op and device pair. 41 | * Enhance Global Kernel Registry to allow cleanup of registered kernels when a 42 | library is unloaded. 43 | * Build the library compatibility checking mechanism, and unload libraries when 44 | they are found to be incompatible 45 | 46 | Finally, we will add the following advanced checks 47 | * Keep track of which libraries provide which kernels 48 | * Garbage collection of unqualified kernels, and their libraries. 49 | 50 | ## Detailed Current State 51 | While this document proposes a new way to **load kernels**, there is a lot of 52 | ideas we would like to adopt from the way ops are loaded. Therefore, current 53 | op loading mechanism is also described in this section. 54 | 55 | ### Op loading 56 | Currently, we can load op libraries from shared objects. When loading custom or 57 | contrib ops, we also load their kernels. The following pseudocode describes how 58 | the current custom/contrib op loading mechanism works: 59 | * Custom contrib op Python bindings are not loaded until they are accessed. 60 | * At the first access, the `__init__` file of the custom op module calls `tf.load_op_library` 61 | * `load_op_library` loads the shared object using `TF_LoadLibrary` in the C API 62 | * Once the shared object is loaded, `load_op_library` now executes and loads the rest of the Python code in the op library. 63 | 64 | Now, diving deep into `TF_LoadLibrary` 65 | * `TF_LoadLibrary` is called. This is just a thin wrapper and status checker around `tensorflow::LoadLibrary` 66 | * `tensorflow::LoadLibrary` checks first if this shared object is already loaded 67 | * In a serial way, making sure only one library is processed at a time: 68 | * It starts a watcher for `OpRegistry`, to get a list of ops included in the library 69 | * Try loading the library using `Environment::LoadLibrary` 70 | * Which just calls `tensorflow::internal::LoadLibrary` 71 | * Which is essentially just `dlopen`. 72 | 73 | ### Kernel loading 74 | Currently, kernel loading mechanism is simpler than the op loading mechanism, at least at loading time. The mechanism can be summarized as follows: 75 | * Kernels use `REGISTER_KERNEL_BUILDER` macro to create a static initializer 76 | * The static initializer is just an object of type `OpKernelRegistrar` 77 | * Which calls `OpKernelRegistrar::InitInternal` 78 | * Which saves the kernel in the `GlobalKernelRegistry`, with a factory method. 79 | * Kernel is read from the registry and instantiated when op tries to be executed. 80 | 81 | ## Design 82 | Here we will describe the details of the work we plan to perform. The work will be divided into three milestones: 83 | 84 | ### Milestone 1: Load kernels from shared objects 85 | This phase will just be a simple proof of concept, to show that loading kernels 86 | from shared objects will work. The deliverables of this phase are: 87 | 1. `tf.load_kernel_library` api. This new method on our API will be responsible 88 | for loading kernels from given shared objects, or folders containing shared 89 | objects. It will: 90 | * Load the given shared object, if it is an `.so` file 91 | * If a folder is given, load all `libtfkernel-*` shared object files in the folder 92 | 2. Split one or more kernels into a different shared object. This will involve: 93 | * Resolve the `BUILD` dependency mess to be able to create a reasonably small 94 | shared object for a kernel (size will be optimized later). 95 | * Resolve all symbol collisions stemming from the different shared objects, 96 | potentially both depending on core TF framework. 97 | * Finally, on the Python side of the op whose kernel is being split out, add 98 | the directive: `tf.load_kernel_library(“libtfkernel_kernel_name.so”)` 99 | 3. Get a bazel test to pass with a split kernel library 100 | 4. Get a working Python wheel file with a split kernel library, and run the 101 | kernel from the shared object. 102 | To simplify the proof of concept, at this stage we will only do this on linux. 103 | 104 | ### Milestone 2: Enable kernel compatibility checks 105 | Once the proof of concept is ready, we need to start building the fancier 106 | features of the proposal. These will be: 107 | 1. Create a mechanism to save the compiler options from bazel side, and make 108 | them available to read in C++ runtime. 109 | 2. Create a mechanism in addition to `KernelDef` to be stored in the 110 | `GlobalKernelRegistry` to help decide which kernels should be loaded. The 111 | following is the data structure we propose for this information: 112 | ```c 113 | typedef struct TF_DsoDef { 114 | const char* name; 115 | const char* version; 116 | }; 117 | 118 | typedef struct TF_HardwareDef { 119 | const char** SIMD_ISA; // Or enum 120 | int SIMD_ISA_length; 121 | char* cpu_arch; 122 | const char** accelerator; 123 | int accelerator_length; 124 | }; 125 | 126 | typedef struct TF_CompilerDef { 127 | const char* compiler; 128 | const char* compiler_version; 129 | const char** compiler_options; 130 | int compiler_options_length; 131 | int memory_alignment; 132 | }; 133 | 134 | typedef struct TF_SourceDef { 135 | const char* git_hash; 136 | }; 137 | 138 | typedef struct TF_KernelBuildInfo { 139 | TF_DsoDef* dependencies; 140 | int dependencies_list_size; 141 | 142 | TF_SourceDef source_version; 143 | TF_HardwareDef hardware_def; 144 | TF_CompilerDef compiler_def; 145 | }; 146 | ``` 147 | 3. Create Methods to extract all the above information from the core runtime, 148 | to check for compatibility with any given kernel library. 149 | 4. During kernel registration, implement checks for the following: 150 | * Is this kernel compatible with the given hardware 151 | * Is this kernel compatible with the software available on the system 152 | * Is this kernel ABI compatible with the core runtime 153 | * Is this kernel faster than any other kernels that are loaded. In this context faster means one of the following: 154 | * Better optimized for the hardware 155 | * Uses a special acceleration library such as MKL 156 | 5. Provide means to override some of the above checks for loading experimental kernels 157 | 6. Expand Global kernel registry to be functionally similar to the op registry. Op registry can unregister ops if there are any problems during the object loading, kernel registry should be able to do the same. 158 | 159 | ### Milestone 3: Make it work on different OSs 160 | While the above will be done on linux, we will have to get things to work on all operating systems we support. For macos, the issues are mainly around bazel bugs. For windows, we will have to be more careful about symbol collisions, and partial lockdown of symbol exports may be required to get things working. 161 | 162 | ### Milestone 4: Memory and performance optimizations 163 | When we load multiple shared objects, we can easily have some bloat in memory 164 | usage, or performance hits. The simplest things we can foresee are: 165 | 1. Multiple kernel registry entries that are retained when multiple kernels for 166 | the same op and device pair are loaded. 167 | 2. Some shared object may only include slow kernels, and they may just be 168 | included in the distribution for compatibility. We can unload shared objects 169 | from memory if none of the kernels in it are useful. 170 | 3. Minimize the total size of the shared libraries created. Currently, tf 171 | framework is this big monolithic build rule everyone ends up depending on. 172 | Try to slim down the kernels, and get them to a size that makes sense to be 173 | included in tf lite packages. 174 | 4. Make sure there are only kernels in the given shared object. Error out if 175 | someone sneaks in ops in kernel libraries. 176 | 177 | ## Alternatives considered 178 | A number of alternatives have been considered before deciding on this route: 179 | 1. Create and distribute the whole package with different compiler options. 180 | While this is the path of least resistance, the monolithic package that needs 181 | to be tested fully on different hardware and compiler options is becoming 182 | unmanageable. The simplest example is, we have a lot of code that needs to be 183 | tested with GPU compilers only once, but we end up having to run similar tests 184 | with 5+ different compiler options. Such issues drive up our testing costs in 185 | terrms of both resources, and developer time. 186 | 2. Splitting kernels into different binaries rather than different shared 187 | objects. While this will protect us from symbol collisions, ODR violations, or 188 | other classical headaches that plague shared objects, this will make things 189 | slower. Also, we would need to implement shared memory pages to share data 190 | across different processes, which will incur a similar engineering cost to the 191 | proposed approach. Therefore, we decided on using shared libraries instead. 192 | -------------------------------------------------------------------------------- /rfcs/20180507-cond-v2.md: -------------------------------------------------------------------------------- 1 | # **"Functional"** **cond design doc** 2 | 3 | | Status | Approved | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | Skye Wanderman-Milne (skyewm@gmail.com) | 6 | | **Created** | 2018-05-07 | 7 | | **Updated** | 2018-08-22 | 8 | 9 | ## Objective 10 | 11 | **Switch tf.cond to emit a single If op.** 12 | 13 | We can do tf.while_loop next. 14 | 15 | This would make mapping to XLA's control flow constructs easier/possible. In particular, just switching to the If op would be a big win (more work needed to get cond working with XLA than while_loop, which already had a lot of work done), and easier than while loop. It will also making debugging and analysis of cond constructs much simpler, e.g. to implement higher-order derivatives. 16 | 17 | Note that cond will still support side-effecting ops (e.g. variable updates). 18 | 19 | 20 | ## Background material 21 | 22 | tf.cond API: https://www.tensorflow.org/api_docs/python/tf/cond 23 | 24 | If op: https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/functional_ops.cc#L104 25 | 26 | Overview of current control flow implementation: [Implementation of Control Flow in TensorFlow](http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf) 27 | 28 | 29 | ## Design overview 30 | 31 | 32 | ### Functional tf.cond 33 | 34 | The signature of `tf.cond` will stay the same: boolean predicate Tensor, and Python callables for the two branches. The two callables each take no arguments (they instead close over any input tensors), and are required to return the same number and type of tensors. 35 | 36 | We need to convert this to the If op signature, which is a boolean predicate, and FunctionDefs for the two branches. The FunctionDefs are required to have the same number and type of inputs and outputs. Luckily, tfe.defun already gives us the machinery to convert the Python callables into FunctionDefs, including converting closures to inputs and adding extra inputs to make the branch signatures match. This is done via an overloaded Graph subclass, [FuncGraph](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/eager/function.py#L191), which gives us the full flexibility of graphs while creating the branch functions. 37 | 38 | This conversion results in a single If op representing the `tf.cond`. 39 | 40 | 41 | ### Gradients 42 | 43 | The gradient of an If op is another If op. The predicate is the same as the forward op's, and each branch function is the gradient function of the corresponding forward branch. 44 | 45 | This requires the gradient branch functions to access intermediate tensors of the forward branch functions. Internal tensors in a function can't be directly accessed, so we need to add the necessary intermediates as outputs to the forward If op (how to do this is discussed in the "Implementation challenges" section). 46 | 47 | 48 | ### Execution 49 | 50 | There are two choices for running the resulting If ops: 51 | 52 | 53 | 54 | 1. Use the `IfOp` kernel as-is, which runs the functions using `FunctionLibraryRuntime`. 55 | 1. "Lower" the If ops to the current `tf.cond` implementation (i.e. `Switch` and `Merge` nodes). 56 | 57 | (1) is simpler at a high level, but (2) will avoid some of the implementation challenges below. 58 | 59 | The lowering can be implemented as an early (pre-placement) optimization pass, in order for the lowered control flow to be placed, pruned, partitioned, etc. as usual. There are already a few examples of similar passes: ParallelConcatRemovePass, AccumulateNV2RemovePass 60 | 61 | **Update**: this is done: [LowerIfOpPass](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/lower_if_op.h) 62 | 63 | We don't want to lower If ops that will eventually be consumed by the [XLA encapsulation pass](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/jit/jit_compilation_pass_registration.cc#L35), so the TF-XLA bridge can take advantage of the easy-to-convert functional representation. This can be achieved by setting an attribute on the If op indicating whether it should be lowered, determined by e.g. if the If op is in an `XLAContext`. This may prove useful for other future use cases as well, such as transitioning to using the functional representation in the main TF runtime. 64 | 65 | 66 | ## Implementation challenges 67 | 68 | 69 | ### Exposing intermediate tensors to gradient functions 70 | 71 | See the "Gradients" section. We somehow need to add intermediate tensors as outputs to the already-created forward-pass If op and its branch functions. Options: 72 | 73 | 74 | 75 | 1. Create a new If op with the required outputs. To prevent running both the original and new ops, we need to rewire the outputs of the original op to use the new op (and ideally modify any existing Tensor objects as well). 76 | 1. Modify the existing If op in-place. This involves either modifying or replacing the branch functions, and changing the outputs of the op (tricky, but probably doable). 77 | 78 | Note that both of these options require mutating existing graph elements. If the graph has already been run, **this will invalidate any existing Sessions!** Other options: 79 | 80 | 81 | 82 | 1. Use placeholders for intermediates during construction, then use a C++ rewrite (Grappler or GraphOptimizationPass) to rewire the graph. 83 | 1. Output every possible intermediate. 84 | 1. It might already work as-is. 85 | 1. Except for ExtendGraph -- solution could be to make C API and Session share Graph* 86 | 87 | **Update**: we went with (2) output every possible intermediate 88 | 89 | 90 | ### Making branch function outputs match 91 | 92 | After adding the intermediate outputs to the forward If op's branch functions, it's likely the two functions don't have the same output signature anymore. For each new output of each branch, we need to add an extra output tensor to the other branch to mirror it (since the If op requires the two outputs signatures match). 93 | 94 | Note that the "mirror" tensors never need to be read. The original output is only consumed by the corresponding gradient function, which is only executed if the original output's branch is taken. Thus, if the mirror tensor is produced, no consumer of it will be run. However, without pruning and/or non-strict execution, the If op must still produce some value for the mirror tensor. 95 | 96 | _Solution:_ 97 | 98 | Introduce a special op to output mirror tensors. This op's shape inference function will claim to output the same shape and type of the mirrored output, but since the tensor isn't actually needed the kernel will produce some small value to avoid producing large unnecessary values. If/when the op doesn't need to produce a value (e.g. via lowering + pruning), the kernel can CHECK or similar. 99 | 100 | 101 | ### Taking the gradient of deserialized If ops 102 | 103 | We need a graph representing the branch function of an If op in order to take its gradient. We already have a graph as part of creating the function, but if the graph was loaded from a GraphDef, we no longer have this graph. Options: 104 | 105 | 106 | 107 | 1. FunctionDef → Graph method 108 | 109 | 110 | ### Variable initialization 111 | 112 | Variables created in the `cond` input callables must be created in the main graph, not in the temporary `FuncGraphs`. Luckily this is already handled by [init_scope](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/framework/ops.py#L5230), which should already be used as necessary to handle creating variables in Defuns, etc. 113 | 114 | 115 | ### Collections 116 | 117 | We must support reading and writing to collections in the `cond` input callables. 118 | 119 | Reading from collections in eager-mode defuns already works by copying the collections into the `FuncGraphs`, which should presumably work here as well. 120 | 121 | For writing, we'll have to forward or copy the values back to the original collections. This is tricky and poorly-defined for Tensor and Operation values, and possibly intractable for data structures containing graph elements (e.g. `WhileContext`). Options: 122 | 123 | 124 | 125 | 1. Collections are supposed to go away in TF 2.0 126 | 1. Somehow turn Tensors into function outputs 127 | 1. Can some tensors/operations be pulled out of the function? 128 | 1. Expose "legacy cond" in contrib, eventually deprecate. 129 | 130 | **Writing to collections requires more investigation.** 131 | 132 | For example, how are people using collections within `cond` branches? How do they avoid dead Tensors? 133 | 134 | 135 | ### Name/device/colocation scope 136 | 137 | Similar to reading collections, any graph-wide stacks and other state can be copied into the `FuncGraphs`. New scopes can then be added within the FuncGraph, and the semantics prevent any added state from persisting beyond the input callable. 138 | 139 | For colocation, we can possibly use external tensor names as-is, since they'll either be lowered into the main graph or compiled by XLA. 140 | 141 | 142 | ### Control dependencies 143 | 144 | If the `tf.cond` call occurs inside a control_dependencies block, the control inputs will be added directly to the resulting If op. 145 | 146 | If the `cond` input callables contain control_dependencies blocks referring external tensors, we can create Identity nodes of the external tensors inside the function definition, and then create internal control edges (functions only have data inputs). 147 | 148 | _The following concerns are avoided by lowering If ops before execution (see "Execution" section):_ 149 | 150 | 151 | ### Devices 152 | 153 | Akshay is working on allowing functions to run across multiple devices. My understanding is that it's mostly working, with a few limitations (e.g. all arguments to the function must go through the caller device, colocation with external tensors doesn't work). 154 | 155 | 156 | ### Partial evaluation 157 | 158 | TF graphs are pruned before execution, meaning only the subgraph needed to compute the requested output tensors is run (this doesn't work completely for ops in a conditional branch, but some pruning still occurs). This is not currently possible with TF functions; the entire function is run regardless of which outputs are needed. This would need to be supported for parity with the current `cond` implementation. 159 | 160 | 161 | ### Non-strict execution 162 | 163 | The current `cond` implementation allows each op in the taken branch to be run as soon as its inputs are ready, even if other ops in the branch aren't ready yet ("non-strict" execution). However, each TF op kernel will only begin running once it's inputs are all ready ("strict" execution), with `Merge` nodes being the only exception. If we replace the current `cond` construct with a single op, this will switch `cond` to strict execution. We would need to support non-strict execution of If ops and their branch functions. 164 | 165 | 166 | ## Future work 167 | 168 | **tf.while_loop**. This effort will solve most of the problems with switching to a functional While representation (or a recursive function representation?). The remaining challenges are inserting stacks for the gradients, and support parallel iterations. 169 | 170 | **C API support.** Ideally other language bindings support conditional execution as well. The C API already includes the primitives for other bindings to implement something similar to `tf.cond` that produces an `If` op, but the C API `TF_AddGradients` method would need to support `If` ops in order for other bindings to (easily) allow autodiff of conditionals. 171 | -------------------------------------------------------------------------------- /LICENSE: -------------------------------------------------------------------------------- 1 | Copyright 2018 The TensorFlow Authors. All rights reserved. 2 | 3 | Apache License 4 | Version 2.0, January 2004 5 | http://www.apache.org/licenses/ 6 | 7 | TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION 8 | 9 | 1. Definitions. 10 | 11 | "License" shall mean the terms and conditions for use, reproduction, 12 | and distribution as defined by Sections 1 through 9 of this document. 13 | 14 | "Licensor" shall mean the copyright owner or entity authorized by 15 | the copyright owner that is granting the License. 16 | 17 | "Legal Entity" shall mean the union of the acting entity and all 18 | other entities that control, are controlled by, or are under common 19 | control with that entity. For the purposes of this definition, 20 | "control" means (i) the power, direct or indirect, to cause the 21 | direction or management of such entity, whether by contract or 22 | otherwise, or (ii) ownership of fifty percent (50%) or more of the 23 | outstanding shares, or (iii) beneficial ownership of such entity. 24 | 25 | "You" (or "Your") shall mean an individual or Legal Entity 26 | exercising permissions granted by this License. 27 | 28 | "Source" form shall mean the preferred form for making modifications, 29 | including but not limited to software source code, documentation 30 | source, and configuration files. 31 | 32 | "Object" form shall mean any form resulting from mechanical 33 | transformation or translation of a Source form, including but 34 | not limited to compiled object code, generated documentation, 35 | and conversions to other media types. 36 | 37 | "Work" shall mean the work of authorship, whether in Source or 38 | Object form, made available under the License, as indicated by a 39 | copyright notice that is included in or attached to the work 40 | (an example is provided in the Appendix below). 41 | 42 | "Derivative Works" shall mean any work, whether in Source or Object 43 | form, that is based on (or derived from) the Work and for which the 44 | editorial revisions, annotations, elaborations, or other modifications 45 | represent, as a whole, an original work of authorship. For the purposes 46 | of this License, Derivative Works shall not include works that remain 47 | separable from, or merely link (or bind by name) to the interfaces of, 48 | the Work and Derivative Works thereof. 49 | 50 | "Contribution" shall mean any work of authorship, including 51 | the original version of the Work and any modifications or additions 52 | to that Work or Derivative Works thereof, that is intentionally 53 | submitted to Licensor for inclusion in the Work by the copyright owner 54 | or by an individual or Legal Entity authorized to submit on behalf of 55 | the copyright owner. For the purposes of this definition, "submitted" 56 | means any form of electronic, verbal, or written communication sent 57 | to the Licensor or its representatives, including but not limited to 58 | communication on electronic mailing lists, source code control systems, 59 | and issue tracking systems that are managed by, or on behalf of, the 60 | Licensor for the purpose of discussing and improving the Work, but 61 | excluding communication that is conspicuously marked or otherwise 62 | designated in writing by the copyright owner as "Not a Contribution." 63 | 64 | "Contributor" shall mean Licensor and any individual or Legal Entity 65 | on behalf of whom a Contribution has been received by Licensor and 66 | subsequently incorporated within the Work. 67 | 68 | 2. Grant of Copyright License. Subject to the terms and conditions of 69 | this License, each Contributor hereby grants to You a perpetual, 70 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 71 | copyright license to reproduce, prepare Derivative Works of, 72 | publicly display, publicly perform, sublicense, and distribute the 73 | Work and such Derivative Works in Source or Object form. 74 | 75 | 3. Grant of Patent License. Subject to the terms and conditions of 76 | this License, each Contributor hereby grants to You a perpetual, 77 | worldwide, non-exclusive, no-charge, royalty-free, irrevocable 78 | (except as stated in this section) patent license to make, have made, 79 | use, offer to sell, sell, import, and otherwise transfer the Work, 80 | where such license applies only to those patent claims licensable 81 | by such Contributor that are necessarily infringed by their 82 | Contribution(s) alone or by combination of their Contribution(s) 83 | with the Work to which such Contribution(s) was submitted. If You 84 | institute patent litigation against any entity (including a 85 | cross-claim or counterclaim in a lawsuit) alleging that the Work 86 | or a Contribution incorporated within the Work constitutes direct 87 | or contributory patent infringement, then any patent licenses 88 | granted to You under this License for that Work shall terminate 89 | as of the date such litigation is filed. 90 | 91 | 4. Redistribution. You may reproduce and distribute copies of the 92 | Work or Derivative Works thereof in any medium, with or without 93 | modifications, and in Source or Object form, provided that You 94 | meet the following conditions: 95 | 96 | (a) You must give any other recipients of the Work or 97 | Derivative Works a copy of this License; and 98 | 99 | (b) You must cause any modified files to carry prominent notices 100 | stating that You changed the files; and 101 | 102 | (c) You must retain, in the Source form of any Derivative Works 103 | that You distribute, all copyright, patent, trademark, and 104 | attribution notices from the Source form of the Work, 105 | excluding those notices that do not pertain to any part of 106 | the Derivative Works; and 107 | 108 | (d) If the Work includes a "NOTICE" text file as part of its 109 | distribution, then any Derivative Works that You distribute must 110 | include a readable copy of the attribution notices contained 111 | within such NOTICE file, excluding those notices that do not 112 | pertain to any part of the Derivative Works, in at least one 113 | of the following places: within a NOTICE text file distributed 114 | as part of the Derivative Works; within the Source form or 115 | documentation, if provided along with the Derivative Works; or, 116 | within a display generated by the Derivative Works, if and 117 | wherever such third-party notices normally appear. The contents 118 | of the NOTICE file are for informational purposes only and 119 | do not modify the License. You may add Your own attribution 120 | notices within Derivative Works that You distribute, alongside 121 | or as an addendum to the NOTICE text from the Work, provided 122 | that such additional attribution notices cannot be construed 123 | as modifying the License. 124 | 125 | You may add Your own copyright statement to Your modifications and 126 | may provide additional or different license terms and conditions 127 | for use, reproduction, or distribution of Your modifications, or 128 | for any such Derivative Works as a whole, provided Your use, 129 | reproduction, and distribution of the Work otherwise complies with 130 | the conditions stated in this License. 131 | 132 | 5. Submission of Contributions. Unless You explicitly state otherwise, 133 | any Contribution intentionally submitted for inclusion in the Work 134 | by You to the Licensor shall be under the terms and conditions of 135 | this License, without any additional terms or conditions. 136 | Notwithstanding the above, nothing herein shall supersede or modify 137 | the terms of any separate license agreement you may have executed 138 | with Licensor regarding such Contributions. 139 | 140 | 6. Trademarks. This License does not grant permission to use the trade 141 | names, trademarks, service marks, or product names of the Licensor, 142 | except as required for reasonable and customary use in describing the 143 | origin of the Work and reproducing the content of the NOTICE file. 144 | 145 | 7. Disclaimer of Warranty. Unless required by applicable law or 146 | agreed to in writing, Licensor provides the Work (and each 147 | Contributor provides its Contributions) on an "AS IS" BASIS, 148 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or 149 | implied, including, without limitation, any warranties or conditions 150 | of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A 151 | PARTICULAR PURPOSE. You are solely responsible for determining the 152 | appropriateness of using or redistributing the Work and assume any 153 | risks associated with Your exercise of permissions under this License. 154 | 155 | 8. Limitation of Liability. In no event and under no legal theory, 156 | whether in tort (including negligence), contract, or otherwise, 157 | unless required by applicable law (such as deliberate and grossly 158 | negligent acts) or agreed to in writing, shall any Contributor be 159 | liable to You for damages, including any direct, indirect, special, 160 | incidental, or consequential damages of any character arising as a 161 | result of this License or out of the use or inability to use the 162 | Work (including but not limited to damages for loss of goodwill, 163 | work stoppage, computer failure or malfunction, or any and all 164 | other commercial damages or losses), even if such Contributor 165 | has been advised of the possibility of such damages. 166 | 167 | 9. Accepting Warranty or Additional Liability. While redistributing 168 | the Work or Derivative Works thereof, You may choose to offer, 169 | and charge a fee for, acceptance of support, warranty, indemnity, 170 | or other liability obligations and/or rights consistent with this 171 | License. However, in accepting such obligations, You may act only 172 | on Your own behalf and on Your sole responsibility, not on behalf 173 | of any other Contributor, and only if You agree to indemnify, 174 | defend, and hold each Contributor harmless for any liability 175 | incurred by, or claims asserted against, such Contributor by reason 176 | of your accepting any such warranty or additional liability. 177 | 178 | END OF TERMS AND CONDITIONS 179 | 180 | APPENDIX: How to apply the Apache License to your work. 181 | 182 | To apply the Apache License to your work, attach the following 183 | boilerplate notice, with the fields enclosed by brackets "[]" 184 | replaced with your own identifying information. (Don't include 185 | the brackets!) The text should be enclosed in the appropriate 186 | comment syntax for the file format. We also recommend that a 187 | file or class name and description of purpose be included on the 188 | same "printed page" as the copyright notice for easier 189 | identification within third-party archives. 190 | 191 | Copyright 2017, The TensorFlow Authors. 192 | 193 | Licensed under the Apache License, Version 2.0 (the "License"); 194 | you may not use this file except in compliance with the License. 195 | You may obtain a copy of the License at 196 | 197 | http://www.apache.org/licenses/LICENSE-2.0 198 | 199 | Unless required by applicable law or agreed to in writing, software 200 | distributed under the License is distributed on an "AS IS" BASIS, 201 | WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. 202 | See the License for the specific language governing permissions and 203 | limitations under the License. 204 | -------------------------------------------------------------------------------- /rfcs/20180731-dockerfile-assembler.md: -------------------------------------------------------------------------------- 1 | # TensorFlow Dockerfile Assembler 2 | 3 | | Status | Accepted | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | Austin Anderson (angerson@google.com) | 6 | | **Sponsor** | Gunhan Gulsoy (gunan@google.com) | 7 | | **Updated** | 2018-08-23 | 8 | 9 | 10 | # Summary 11 | 12 | This document describes a new way to manage TensorFlow's dockerfiles. Instead 13 | of handling complexity via an on-demand build script, Dockerfile maintainers 14 | manage re-usable chunks called partials which are assembled into documented, 15 | standard, committed-to-repo Dockerfiles that don't need extra scripts to build. 16 | It is also decoupled from the system that builds and uploads the Docker images, 17 | which can be safely handled by separate CI scripts. 18 | 19 | **Important:** This document is slim. The real meat of the design has already 20 | been implemented in [this PR to 21 | tensorflow/tensorflow](https://github.com/tensorflow/tensorflow/pull/21291). 22 | 23 | **Also Important:** This design is not currently attempting to revise the images for speed or size: the design sets out a process that makes optimizing the images much easier to do on a larger scale. 24 | 25 | # Background 26 | 27 | TensorFlow's Docker offerings have lots of problems that affect both users and 28 | developers. [Our images](https://hub.docker.com/r/tensorflow/tensorflow/) are 29 | not particularly well defined or documented, and [our 30 | Dockerfiles](https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/docker) 31 | are complicated and frightening. 32 | 33 | ## Existing Images are Hard to Optimize 34 | 35 | TensorFlow's current set of Dockerfiles are difficult to optimize. Developers 36 | dislike pulling enormous Docker containers, and many of our tags could be 37 | considered clunky (tag sizes yanked from Dockerhub, see also @flx42's comment 38 | on this doc's PR for on-disk sizes): 39 | 40 | | Image Tag | Size | 41 | |:-------------------|-------:| 42 | |latest-devel-gpu-py3| 1 GB | 43 | |latest-devel-py3 | 773 MB | 44 | |latest-gpu-py3 |1 GB | 45 | |latest-py3 | 438 MB | 46 | |latest-devel-gpu | 1 GB | 47 | |latest-devel | 727 MB | 48 | |latest-gpu | 1 GB | 49 | |latest | 431 MB | 50 | 51 | Including an extra dependency like Jupyter and convenience packagesnot can add 52 | a few hundred megabytes of extra storage. Since some developers want to have 53 | Jupyter in the images and it's too much trouble for us to maintain many similar 54 | Dockerfiles, we've ended up with a limited set of non-optimized images. I'm not 55 | sure if this truly a critical problem, but it's a little annoying (one of my 56 | personal computers only has 32 GB of SSD space on the root drive, and I 57 | regularly need to wipe my docker cache of large images). 58 | 59 | ## TF Docker Images need Complexity 60 | 61 | Our Docker images support two primary use cases: development _with_ TensorFlow, 62 | and development _on_ TensorFlow. We want a matrix of options available for both 63 | of these types of users, the most critical being GPU development support 64 | (currently nvidia-only) and pre-installed Jupyter support. With only those 65 | options considered, we target eight very similar Docker images; sixteen with 66 | Python versioning. 67 | 68 | Our current images come from a script called `parameterized_docker_build.sh`, 69 | which live-edits a templated Dockerfile with `sed` to insert new Dockerfile 70 | commands. The script has a poor reputation because it's hard to understand, can 71 | be finicky, and is not easily understood compared to vanilla Dockerfiles. Some 72 | Dockerfiles are duplicated, some are unused, and some users have made their own 73 | instead. None of the Dockerfiles use the ARG directive. 74 | 75 | Furthermore, `parameterized_docker_build.sh` is tightly coupled with the 76 | deploy-to-image-hub process we use, which is confusing because users who build 77 | the images locally don't need that information at all. 78 | 79 | This document proposes a new way for the TF team to maintain this complex set 80 | of similar Dockerfiles. 81 | 82 | # Design 83 | 84 | We use a generator to assemble multiple partial Dockerfiles into concrete 85 | Dockerfiles that get committed into source control. These Dockerfiles are fully 86 | documented and support argument customization. Unlike the parameterized image 87 | builder script, this system excludes the image deployment steps, which should 88 | be handled by a totally different system anyway. 89 | 90 | This section lightly describes the design, which is fully implemented in [this 91 | pull request to the main TensorFlow 92 | repo](https://github.com/tensorflow/tensorflow/pull/21291). 93 | 94 | Partial files are syntactically valid but incomplete files with Dockerfile 95 | syntax. 96 | 97 | Assembly is controlled by a specification file, defined in yaml. The spec 98 | defines the partials, the ARGs they use, the list of Dockerfiles to generate 99 | based on ordered lists of partials, and documentation for those values. 100 | 101 | The assembler is a python script that accepts a spec and generates a bunch of 102 | Dockerfiles to be committed. The spec includes documentation and descriptions, 103 | and the output Dockerfiles are fully documented and can be built manually. 104 | 105 | **Important**: This design in its current implementation does **not** attempt 106 | to address the limitations of our current set of images. Instead, it replicates 107 | the current set of tags with a few easy improvements, the most notable being a 108 | separate set of Dockerfiles that add Jupyter -- identical in every way to the 109 | non-Jupyter images without needing any extra maintenance. This design makes it 110 | much easier to craft TensorFlow's Docker offering in a way that satisfies 111 | everyone with minimal extra work from the Dockerfile maintainers. 112 | 113 | # Impact 114 | 115 | This approach has many convenient benefits: 116 | 117 | * The result is concrete, buildable, documented Dockerfiles. Users who wish 118 | to build their own images locally do not need to also understand the build 119 | system. Furthermore, basing our images on clean Dockerfiles that live in the repository feels clean and right -- as a user, I (personally) like to be able to see how an image works. It removes the mystery and magic from the process. 120 | * This implementation is agnostic to what images we would like to make 121 | available online (i.e. our Docker story). It's very easy to add new dockerfile 122 | outputs. 123 | * The build-test-and-deploy-images process is decoupled from the Dockerfile 124 | generation process. 125 | * Control of the set of dockerfiles is centralized to the spec file, instead 126 | of being spread across each Dockerfile. 127 | * The spec can be extended to add more conveniences. My implementation, for 128 | example, already includes de-duplication of many similar Dockerfile 129 | specifications. 130 | * All dockerfiles are consistently documented. 131 | * Common pieces of code, like a slick shell environment or a Jupyter 132 | interface, can be updated in batch by updating a single partial file. 133 | * The spec can also be used in the image building process, e.g. to read all 134 | available args. 135 | 136 | # Caveats and Rejected Alternatives 137 | 138 | I considered two alternatives while working on this. 139 | 140 | ## Hacky Multi-Stage Dockerfile 141 | 142 | "Multi-stage Building" is a powerful new Dockerfile feature that supports 143 | multiple FROM statements in one Dockerfile. Multi-stage builds let you build 144 | and run an artifact (like a compiled version of a binary) in any number of 145 | separate stages designated by FROM directives; the resulting image is only as 146 | large as the final stage without the build-only dependencies from previous 147 | stages. 148 | 149 | However, Docker's ARG parameter expansion can be used in these extra FROM 150 | directives to conditionally set base images for each build stage: 151 | 152 | 153 | ```dockerfile 154 | # If --build-arg FROM_FOO is set, build FROM foo. else build FROM bar. 155 | ARG FROM_FOO 156 | ARG _HELPER=${FROM_FOO:+foo} 157 | ARG BASE_IMAGE=${_HELPER:-bar} 158 | FROM ${BASE_IMAGE} 159 | … 160 | ``` 161 | 162 | This means that it's possible to use multi-stage builds and ARGs to create 163 | stages that are conditionally based on previous stages in the Dockerfile. 164 | [This sample 165 | Dockerfile](https://gist.github.com/angersson/3d2b5ae6a01de4064b1c3fe7a56e3821), 166 | which I've included only as a demonstration of a bad idea (and may currently 167 | work), is very powerful but not extensible and not easy to understand. It is 168 | heavily coupled to our current environment, which may change immensely e.g. if 169 | AMD releases Docker images similar to Nvidia's or if someone would like to add 170 | MKL support. 171 | 172 | ## Multiple Normal Dockerfiles Aggregated into Multiple Stages 173 | 174 | In a [comment on this doc's PR](https://github.com/tensorflow/community/pull/8#issuecomment-410080344), @flx42 suggested a much-improved version of the 175 | previous section. Another way of using ARG interpolation in FROM lines would be 176 | to write multiple isolated Dockerfiles that can be layered together during the `docker build` process: 177 | 178 | 179 | ``` 180 | ARG from 181 | FROM ${from} 182 | 183 | ARG PIP 184 | RUN ${PIP} install jupyter 185 | ``` 186 | 187 | And then: 188 | 189 | ``` 190 | $ docker build -t nvidia-devel -f Dockerfile.nvidia-devel . 191 | $ docker build -t nvidia-devel-jupyter-py3 --build-arg from=nvidia-devel --build-arg pip=pip3 -f Dockerfile.jupyter . 192 | ``` 193 | 194 | This shares the advantage of the current design by working from many reusable parts, but carries some notable tradeoffs: 195 | 196 | ### Advantages over Current Design 197 | 198 | I can see a variety of minor improvements: 199 | 200 | - No need for assembler script or spec file 201 | - Possibly faster build times due to concretely isolated image stages 202 | - Image stages (akin to partials) may be more reusable due to slot-like usage of `--build-args` 203 | - Because there are no concrete Dockerfiles, there's only one place that defines the Dockerhub tags and what components describe them (in the current design, the spec files describes the Dockerfiles, and then a little more logic elsewhere in our CI would configure those Dockerfiles with the tags) 204 | 205 | ### Downsides compared to Current Design 206 | 207 | ...but some downsides that I think are fairly heavy: 208 | 209 | - Spec + Assembler have some very nice advantages (validation, re-use, etc.) 210 | - No concrete Dockerfiles for OSS devs to use / refer to 211 | - Advanced usage requires some unintuitive file/directory layout + build ordering 212 | - Image-building complexity offloaded to OSS developers and to the CI scripts, which would need scripts / logic to define sets of images to build 213 | - Updating requires familiarity with multi-stage behavior 214 | 215 | ### Conclusion 216 | 217 | This is an interesting approach that I like a lot, but I don't think it offers 218 | enough benefits over the current design (which has another advantage in that it 219 | is already mostly finished) to implement. 220 | 221 | It's worth noting that using multiple FROM stages is a powerful tool that could 222 | possibly be leveraged in the partials for the current design. 223 | 224 | ## Manually Maintained Dockerfiles with Script References 225 | 226 | Another pattern that supports complicated Dockerfiles is to manually maintain 227 | many Dockerfiles that each call out to a common set of build scripts: 228 | 229 | ```dockerfile 230 | FROM ubuntu 231 | COPY install_scripts/ /bin 232 | RUN /bin/install_nvidia_dev.sh 233 | RUN /bin/install_python_dev.sh 234 | RUN /bin/install_bazel.sh 235 | ... 236 | ``` 237 | 238 | This is better than our current approach, but has many small drawbacks that add 239 | up: 240 | 241 | * Argument passing becomes slightly more complex, because ARGs must be passed 242 | and read as either ENV variables or as build arguments. 243 | * Each dockerfile has to be properly documented manually, if at all. 244 | * Developers have to leave the Dockerfile to read the shell scripts, which 245 | gets annoying. 246 | * Maintenance is spread across the dockerfiles and the scripts, and can grow 247 | into even more work (like some Dockerfiles having extra non-script directives, 248 | etc.). 249 | * Extra overhead in the scripts can be kind of wasteful 250 | 251 | # Work Estimates 252 | 253 | I have already completed a PR that will introduce these Dockerfiles without 254 | affecting our current builds. These would probably take a week or two to 255 | migrate. 256 | 257 | ## Questions and Discussion Topics 258 | 259 | Seed this with open questions you require feedback on from the RFC process. 260 | -------------------------------------------------------------------------------- /rfcs/20180821-differentiable-functional-while.md: -------------------------------------------------------------------------------- 1 | # Functional while_loop 2 | | Status | Accepted | 3 | :---------------|:-----------------------------------------------------| 4 | | **Author** | Saurabh Saxena (Google) | 5 | | **Sponsor** | Skye Wanderman-Milne (Google) | 6 | | **Updated** | 2018-08-23 | 7 | 8 | 9 | ## Objective 10 | 11 | This proposal talks about an implementation of [while_loop](https://www.tensorflow.org/api_docs/python/tf/while_loop) which adds a single While op to the GraphDef as opposed to the current implementation that uses [lower level primitives](https://arxiv.org/abs/1805.01772). The goal is to simplify debugging and other analysis and to make it easier for compiler backends like XLA to [recognize](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/compiler/tf2xla/functionalize_while.cc) the while loop in the GraphDef. At runtime, a C++ optimization pass will lower this op to the primitive dataflow ops for feature parity with the current implementation similar to how we do for the [If op](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/lower_if_op.cc). 12 | 13 | 14 | ## Motivation 15 | 16 | TensorFlow provide two flavours of control flow constructs which differ widely in the way they manifest themselves in the GraphDef: 17 | 18 | 19 | 20 | 1. Functional ops which create a single node in the Graph: [If](https://github.com/tensorflow/tensorflow/blob/fc4504edb1ab419ae59b0ebb9ff8d943beb61117/tensorflow/core/ops/functional_ops.cc#L104), [While](https://github.com/tensorflow/tensorflow/blob/fc4504edb1ab419ae59b0ebb9ff8d943beb61117/tensorflow/core/ops/functional_ops.cc#L147). 21 | 1. Non-functional ops which make use of primitive control flow constructs namely Enter, Exit, Switch, Merge and NextIteration: [tf.cond](https://www.tensorflow.org/api_docs/python/tf/cond), [tf.while_loop](https://www.tensorflow.org/api_docs/python/tf/while_loop). 22 | 23 | Both approaches have their merits and demerits. The functional representation emits a single node in the GraphDef thus making it easy to recognize such ops in processing pipelines that operate on the GraphDef, which is not the case when control flow is represented using lower level primitives. The functional representation is however not easily differentiable and requires using the [SymbolicGradient](https://github.com/tensorflow/tensorflow/blob/a0e76ce73c5f095fc61e06c19ff8e653cfd2965c/tensorflow/core/ops/functional_ops.cc#L24) op which recomputes the forward pass(slow) and needs symbolic gradients defined for all ops in the function body which can be complicated to implement. Also since we force a strict execution of functions, i.e., a function can start executing only after its inputs are all ready, the functional ops may not be that performant. The current representation solved these problems at the cost of a slightly complicated GraphDef. In this proposal, we try to achieve the best of both worlds. 24 | 25 | We recently added a differentiable version of the [functional If/cond op](https://github.com/tensorflow/community/blob/master/rfcs/20180507-cond-v2.md). As with functional cond, a key challenge here is to figure out gradient computation. For cond, we could expose the [intermediate tensors](https://github.com/tensorflow/tensorflow/blob/51100a8de57ef53e36a8a9f5a9829cbd33fbed04/tensorflow/python/ops/cond_v2_impl.py#L114) as op outputs so that they could be used for computing gradients. We cannot directly do the same for while loops since we would need the intermediate values _for all iterations_ and not just the values after the last iteration. Hence, some sort of accumulator is required. We use TensorLists for accumulating the loop body intermediates. Since while loops may run for a large number of iterations, e.g. long RNNs, we need to be mindful of the memory usage by accumulators. 26 | 27 | 28 | ## Design Proposal 29 | 30 | 31 | ### Accumulating intermediates 32 | 33 | 34 | #### Stack vs TensorArray vs TensorList 35 | 36 | The current implementation uses [Stacks](https://github.com/tensorflow/tensorflow/blob/51100a8de57ef53e36a8a9f5a9829cbd33fbed04/tensorflow/python/ops/control_flow_ops.py#L1002) for accumulating intermediate values from the forward pass that may be needed for gradient computation. This implementation will use [TensorLists](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/list_ops.cc)(TL) instead which, unlike Stack and TensorArray, do not have a mutable internal state making them easy to differentiate. 37 | 38 | 39 | #### Algorithm 40 | 41 | For each intermediate tensor of the while loop function body that may be needed for gradient computation, we create an empty TensorList and add it to the list of loop_vars. We then push the intermediate values to the TL using the [TensorListPushBack](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/ops/list_ops.cc#L40) op. Note that this way we may be accumulating more tensors than are actually needed for gradient computation. It is even possible that the graph is just used for inference and hence we do not need the accumulators at all! We rely on the [C++ optimization pass](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/grappler/optimizers/model_pruner.cc) that happens after the While op is lowered to remove all such superfluous accumulators. So adding extra accumulators will not have any performance or memory overhead at runtime. 42 | 43 | To facilitate use-cases where lowering is not desired we can perform a few optimizations to the functional form of the While op: 44 | 45 | * Expose only those intermediate values that are required by the backward pass by building the gradient graph in the forward pass. 46 | * This will increase graph building time. 47 | * Do not accumulate Const nodes. We can lift these outside the while loop. 48 | * Do not accumulate loop vars that are passed-through unchanged. 49 | * Rewrite the forward pass to add accumulators when gradients are requested. 50 | * This will require creating a new While op and new FunctionDefs for the loop condition and body. 51 | * Since we cannot remove nodes from the Graph there will be unused functions and the dangling While op in the GraphDef. These will however be pruned out at runtime and hence will not affect performance or correctness. 52 | 53 | 54 | ### Computing gradients 55 | 56 | Excerpt from white paper on [Control Flow in TensorFlow](http://download.tensorflow.org/paper/white_paper_tf_control_flow_implementation_2017_11_1.pdf): 57 | 58 | > Intuitively, the gradient of `while_loop(pred, body)` is just a while loop of the form: 59 | > 60 | > 61 | > ``` 62 | > def pred(i, _): return i < N 63 | > while_loop(pred, g_body, [0] + g_vars) 64 | > ``` 65 | > 66 | > Where `N` is the number of iterations that the forward while loop runs, `g_body` is the gradient of the forward loop body, and `g_vars` is the initial values for the loop variables. As we will see later, `g_vars` includes the initial gradients for the loop variables of the forward while loop. 67 | 68 | We use the same logic here as well. To get a count of the number of forward iterations we add an integer counter which is initialized to 0 and is incremented in the loop body. Note that we just need the total number of iterations for the gradient pass so we do not need to accumulate the intermediate values of the counter. This counter is always the first output of the While op. 69 | 70 | To compute *g_body* we use the [gradients_impl._GradientsHelper](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/python/ops/gradients_impl.py#L599) function which supports computing the gradient of a given [src_graph](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/python/ops/gradients_impl.py#L607) in another graph, which in this case is a [_FuncGraph](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/python/framework/function.py#L621). This gradient graph captures references to the intermediate values of the forward graph(the src_graph). We replace these references with popped values from the accumulators of the intermediate tensors. Note that these accumulators were already added to the list of loop_vars of the While op and hence were in the list of outputs of the forward While op. 71 | 72 | We will register a custom python [gradient function](https://github.com/tensorflow/tensorflow/blob/0440ccfc199cbffc10aae19fde07f0100c823ed9/tensorflow/python/framework/ops.py#L2352) to compute the gradient of a functional While op. This will allow taking the gradient of any functional While op(not only the ones generated by the new while_loop function) which satisfies the following conditions: 73 | 74 | 75 | 76 | 1. The first loop output must be the number of loop iterations. 77 | 1. Each intermediate tensor of the While body which may be needed during gradient computation must be accumulated in a TensorList. We will check to make sure that the TensorList is indeed unique to the intermediate value. 78 | 1. The position of the accumulator in the list of inputs and outputs must be the same. 79 | 80 | The While op generated by the gradient function satisfies the above constraints and hence can be differentiated again to generate the 2nd order derivative and so on. 81 | 82 | In the case of nested while loops, we will accumulate the intermediate values of inner while loops in nested TensorLists. 83 | 84 | 85 | ### Memory management 86 | 87 | tf.while_loop swaps the tensors from GPU to CPU when the [swap_memory](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/python/ops/control_flow_ops.py#L3046) flag is set. Section 5.3 of the control flow [paper](https://arxiv.org/abs/1805.01772) mentions that with memory swapping they were able to handle an RNN with 2x the unrolled length(1000 vs 500) with little overhead. The heuristics for memory swapping are implemented in the [StackPush](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/core/kernels/stack_ops.cc#L289) and [StackPop](https://github.com/tensorflow/tensorflow/blob/600caf99897e82cd0db8665acca5e7630ec1a292/tensorflow/core/kernels/stack_ops.cc#L411) ops. We will need to support similar functionality for TensorListPushBack and TensorListPopBack ops. 88 | 89 | 90 | ### Lowering pass 91 | 92 | In order to get feature parity with the current implementation we will lower the While op to the current while loop graph representation as a grappler pass similar to the one for [if_op](https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/common_runtime/lower_if_op.cc). This gets us around some of the issues with the current functional op: 93 | 94 | 95 | 96 | 1. We can perform parallel iterations which are not possible due to the strict mode execution of functions which requires that all inputs to the function must be ready before the function can start executing. We will need to add a `parallel_iterations` attr to the While op. 97 | 1. The FunctionLibraryRuntime currently does not allow running multi-device functions. 98 | 1. We can perform global grappler optimizations without needing to cross function boundaries. E.g. we can remove accumulators for intermediate values which are not consumed downstream. 99 | 100 | 101 | ### Example 102 | 103 | ```python 104 | 105 | x = tf.constant(2.) 106 | 107 | ret = while_loop(lambda v: v < 8., lambda v: v * v, [x]) 108 | 109 | grad = tf.gradients(ret, [x]) 110 | 111 | ``` 112 | 113 | **Current implementation** 114 | 115 | 116 | 117 | ![alt_text](20180821-differentiable-functional-while/while_v1.png "image_tooltip") 118 | 119 | 120 | **New implementation** 121 | 122 | 123 | 124 | ![alt_text](20180821-differentiable-functional-while/while_v2.png "image_tooltip") 125 | 126 | 127 | The forward functional while op is highlighted in red. Note that it takes 2 `Const` nodes as inputs. One of the `Const` nodes is `x` with value 2. The other `Const` node is the initial value of the loop counter which is set to 0. There are also 2 `EmptyTensorList` nodes which are used for accumulating intermediate values. 128 | 129 | *while_cond* 130 | 131 | The loop condition function is fairly trivial. It expects the extra args for the loop counter and accumulators but doesn't actually use them. 132 | 133 | 134 | 135 | ![alt_text](20180821-differentiable-functional-while/while_cond.png "image_tooltip") 136 | 137 | 138 | *while_body* 139 | 140 | The loop body contains the extra nodes for updating the counter and accumulating intermediates. 141 | 142 | 143 | 144 | ![alt_text](20180821-differentiable-functional-while/while_body.png "image_tooltip") 145 | 146 | 147 | `arg0` is the loop counter which gets initialized to 0. This is always the first argument. 148 | 149 | `arg1` is the value of x at the start of each iteration. 150 | 151 | `add_0` is the counter update node and `y` is the increment `Const` node with value 1. 152 | 153 | `mul_0` performs `x * x` 154 | 155 | 156 | Accumulators: 157 | 158 | `tensorlist0` <- `arg1`, the value of `x` at the start of the loop. 159 | 160 | `tensorlist1` <- Output of `mul_0`. 161 | 162 | ## Discussion notes 163 | 164 | Please see notes in [tensorflow/community#13](https://github.com/tensorflow/community/pull/13#issuecomment-422591773). 165 | -------------------------------------------------------------------------------- /rfcs/20180726-tf-data-windowing-reducers.md: -------------------------------------------------------------------------------- 1 | # Generalizing tf.data batching using windowing and reducers 2 | 3 | | Status | Accepted | 4 | :---------------|:-----------------------------------------------------| 5 | | **Author(s)** | Jiri Simsa (Google) | 6 | | **Sponsor** | Derek Murray (Google) | 7 | | **Updated** | 2018-09-19 | 8 | 9 | ## Objective 10 | 11 | This proposal addresses the known limitations of the current tf.data batching API: 12 | 13 | * it provides a mechanism for padded batching of sparse tensors 14 | * it facilitates customization of batching logic (users can now express batching logic as a pure Python function) 15 | * it enables application of different batching logic on different components 16 | 17 | ## **Motivation** 18 | 19 | The tf.data API is the de facto standard for creating TensorFlow input pipelines, whose purpose is to extract data from a storage system, transform it, and load it onto an accelerator. 20 | 21 | A common transformation performed by TensorFlow input pipelines is batching -- combining multiple tensors into a single tensor of higher dimension, most often to make a minibatch for training. Currently, the core tf.data API for batching consists of [batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#batch) and [padded_batch](https://www.tensorflow.org/api_docs/python/tf/data/Dataset#padded_batch). The former assumes the inputs have the same shape and supports both dense and sparse inputs. The latter supports dynamically shaped inputs, such as you might find in sequential data: it assumes the inputs have the same rank but not necessarily the same shape and can pad differently shaped inputs to a common shape; only dense inputs are supported by padded_batch. 22 | 23 | The tf.data batching API has several limitations that has surfaced in various users requests: 24 | 25 | * As already mentioned, the padded_batch transformation does not support sparse tensors inputs ([issue](https://github.com/tensorflow/tensorflow/issues/18302)). 26 | * The current API is not flexible enough to accept user-provided batching logic (e.g. [issue](https://github.com/tensorflow/tensorflow/issues/20391)). 27 | * The same batching logic needs to be applied to all components of the input dataset, which is not always desirable (e.g. [issue](https://github.com/tensorflow/tensorflow/issues/20391)). Users can work around this limitation by creating separate datasets to which different batching transformations are applied and then zipping the datasets; however, this can be inefficient, unergonomic, and error prone. 28 | 29 | 30 | ## Proposal 31 | 32 | This document proposes leveraging the recently introduced support for _nested_ datasets as inputs to tf.data transformations to perform generalized batching as follows: 33 | 34 | 35 | 36 | 1. A __window__ transformation is used to combine consecutive elements of the input into a nested dataset (as opposed to a higher dimensional tensor). 37 | 1. A map transformation is used to, on a per-component basis, apply a suitable __reducer__ which transforms the nested dataset to a batched tensor. 38 | 39 | The underlined transformations do not exist and are the proposed extensions to the tf.data API. 40 | 41 | 42 | ### Windowing 43 | 44 | Windowing combines elements of a dataset into finite datasets referred to as windows. This is similar to batching, with the main difference being that batching combines elements of dataset into a higher dimensional element, while windowing combines the elements to a dataset. 45 | 46 | 47 | ```python 48 | def window(size, shift=1, stride=1, drop_remainder=True): 49 | """Combines input elements into a dataset of windows. 50 | 51 | Each window is a dataset itself and contains `size` elements (or 52 | possibly less if there is not enough input elements to fill the window 53 | and `drop_remainder` evaluates to false). 54 | 55 | The `stride` argument determines the stride of the input elements, 56 | and the `shift` argument determines the shift of the window. 57 | 58 | For example: 59 | - tf.data.range(5).window(3) produces {{0, 1, 2}, {1, 2, 3}, {2, 3, 4}} 60 | - tf.data.range(5).window(3, 3, 1, False) produces {{0, 1, 2}, {3, 4}} 61 | - tf.data.range(6).window(3, 1, 2) produces {{0, 2, 4}, {1, 3, 5}} 62 | 63 | Args: 64 | size: A `tf.int64` scalar `tf.Tensor`, representing the number 65 | of elements of the input dataset to combine into a window. 66 | shift: A `tf.int64` scalar `tf.Tensor`, representing the forward 67 | shift of the sliding window in each iteration. 68 | stride: A `tf.int64` scalar `tf.Tensor`, representing the stride 69 | of the input elements in the sliding window. 70 | drop_remainder: A `tf.bool` scalar `tf.Tensor`, representing whether 71 | a window should be dropped in case its size is smaller than 72 | `window_size`. 73 | 74 | Returns: 75 | Dataset: A `Dataset` whose elements are a `Dataset`. 76 | """ 77 | ``` 78 | 79 | 80 | ### Reducers 81 | 82 | 83 | #### Example 0: Count Elements 84 | 85 | To introduce the concept of tf.data reducers to readers unfamiliar with it, we illustrate how a reducer can be used to count the elements of a dataset: 86 | 87 | 88 | ```python 89 | def count(dataset): 90 | """Counts the elements of a dataset.""" 91 | 92 | def init_fn(_): 93 | return 0 94 | 95 | def reduce_fn(state, value): 96 | return state + 1 97 | 98 | def finalize_fn(state): 99 | return state 100 | 101 | count_reducer = tf.data.Reducer(init_fn, reduce_fn, finalize_fn) 102 | return dataset.reduce(count_reducer) 103 | 104 | value = count(tf.data.Dataset.range(10)) 105 | with tf.Session() as sess: 106 | print(sess.run(value)) # produces 10 107 | ``` 108 | 109 | 110 | As you can see, a tf.data reducer consists of three functions: 1) an _init()_ function that sets up the initial state, which can be an arbitrary nest of tensor-like objects, 2) a _reduce()_ function that defines how to update the intermediate state given the value of the next element, and 3) a _finalize()_ function that defines how to produce the transform the final state into the output value. 111 | 112 | The reducer inputs an entire dataset and reduces it to a single value. This single value is the result of taking the output of init(), calling reduce() successively on every element of the dataset until the dataset is exhausted, and then calling finalize() on the result. 113 | 114 | 115 | #### Example 1: Batch of Dense Tensors 116 | 117 | Next, we illustrate how tf.data reducers can be used to create a batch from a dataset of dense tensors. 118 | 119 | ```python 120 | def batch_dense(dataset): 121 | """Batches a dataset of dense tensors.""" 122 | 123 | if dataset.output_shapes.is_fully_defined(): 124 | shape = dataset.output_shapes 125 | else: 126 | first_element = tf.contrib.data.get_single_element(dataset.take(1)) 127 | shape = tf.shape(first_element) 128 | 129 | def batch_init_fn(_): 130 | """Return an empty Tensor of the correct shape and type.""" 131 | batch_shape = tf.concat([[0], shape], 0) 132 | return gen_array_ops.empty(batch_shape, dtype=dataset.output_types) 133 | 134 | def batch_reduce_fn(state, value): 135 | """Append this value to what we have of the batch so far.""" 136 | return tf.concat([state, [value]], 0) 137 | 138 | def batch_finalize_fn(state): 139 | """Return the batch tensor as constructed so far.""" 140 | return state 141 | 142 | batch_reducer = tf.data.Reducer(batch_init_fn, batch_reduce_fn, 143 | batch_finalize_fn) 144 | return dataset.reduce(batch_reducer) 145 | 146 | batch = batch_dense(tf.data.Dataset.range(5)) 147 | with tf.Session() as sess: 148 | print(sess.run(batch)) # produces [0 1 2 3 4] 149 | 150 | ``` 151 | 152 | 153 | 154 | #### Example 2: Padded Batch of Dense Tensors 155 | 156 | Our next tf.data reducer example illustrates how to use a reducer to create a padded batch from a dataset of dense tensors. 157 | 158 | ```python 159 | def padded_batch_dense(dataset, padded_shape, padding_value): 160 | """Batches a dataset of dense tensors with padding.""" 161 | 162 | padded_shape = tf.cast( 163 | convert.partial_shape_to_tensor(padded_shape), tf.int32) 164 | 165 | def init_fn(_): 166 | return 0, padded_shape 167 | 168 | def reduce_fn(state, value): 169 | count, shape = state 170 | return count + 1, tf.maximum(shape, tf.shape(value)) 171 | 172 | def finalize_fn(state): 173 | return state 174 | 175 | # Compute the padded shape and count elements. 176 | reducer = tf.contrib.Reducer(init_fn, reduce_fn, finalize_fn) 177 | count, padded_shape = dataset.reduce(reducer) 178 | 179 | def pad_fn(value): 180 | shape = tf.shape(value) 181 | left = tf.zeros_like(shape) 182 | right = padded_shape - shape 183 | return tf.pad(value, tf.stack([left, right], 1), 184 | constant_values=padding_value) 185 | 186 | return dataset.map(pad_fn).batch(count) 187 | 188 | padded_batch = padded_batch_dense( 189 | tf.data.Dataset.from_tensor_slices([[1], [2]]), [2], 0)) 190 | .make_one_shot_iterator().get_next() 191 | with tf.Session() as sess: 192 | print(sess.run(padded_batch)) # produces [[1 0] [2 0]] 193 | ``` 194 | 195 | 196 | 197 | ### End-to-end Example 198 | 199 | Finally, we illustrate how to use the window transformation to perform generalized tf.data batching: 200 | 201 | ```python 202 | import tensorflow as tf 203 | 204 | def gen(): 205 | yield ('a', [1]) 206 | yield ('b', [2]) 207 | yield ('c', [3]) 208 | yield ('d', [4, 4]) 209 | 210 | def map_fn(a, b): 211 | return tf.data.Dataset.zip((a.batch(2), b.padded_batch(2, [2]))) 212 | 213 | dataset = tf.data.Dataset.from_generator(gen, (tf.string, tf.int32)) 214 | dataset = dataset.window(2, 2).flat_map(map_fn) 215 | get_next = dataset.make_one_shot_iterator().get_next() 216 | 217 | with tf.Session() as sess: 218 | print(sess.run(get_next)) # produces (['a', 'b'], [[1, 0], [2, 0]]) 219 | print(sess.run(get_next)) # produces (['c', 'd'], [[3, 0], [4, 4]]) 220 | ``` 221 | 222 | 223 | 224 | ## API Changes 225 | 226 | This design document proposes the following changes to the tf.data API: 227 | 228 | * Adding a `tf.data.Dataset.window` method, which provides the windowing functionality described in this proposal. 229 | * Promoting the `tf.contrib.data.reduce_dataset()` method to `tf.data.Dataset.reduce()` and the `tf.contrib.data.Reducer` class to `tf.data.Reducer`. 230 | * Allowing nested datasets as inputs of `map` and `filter`. 231 | * Adding canned reducers for padded batching of dense and sparse tensors to `tf.contrib.data`, changing implementation of `tf.data.Dataset.padded_batch()` to use these, and marking it as deprecated. 232 | 233 | ## Summary 234 | 235 | This proposal addresses known limitations of the current tf.data batching API: 236 | 237 | * it provides a mechanism for padded batching of sparse tensors 238 | * it facilitates customization of batching logic (users can now express batching logic as a pure Python function) 239 | * it enables application of different batching logic on different components 240 | 241 | 242 | ## Discussion Notes 243 | 244 | See also notes from [public review](https://github.com/tensorflow/community/pull/5). The following notes were taken in the review committee. 245 | 246 | Q: What is the better value added by the new examples? 247 | 248 | A: The previous examples were inefficient versions of things that already exist. 249 | 250 | Q: The obvious use of the API led to an inefficient implementation (of batching, using tf.concat()). It might be hard to write batching in this API without it being 251 | 252 | A: This API is not meant to be used to implement something that already exists. 253 | 254 | Q: Is this not a good API for implementing batching? The structure encourages inefficient implementations. 255 | 256 | A: The point was not to illustrate how we do batching efficiently. It's already done. 257 | 258 | Q: I thought the point was to show many different ways to do batching. 259 | 260 | A: The base case is still an efficient implementation of batch, but we can add other logic around it (e.g. to do different forms of padding, etc.). 261 | 262 | Q: What were the biggest questions? 263 | 264 | A: Batching efficiency was the biggest one. Some questions about the signature of the newly introduced transformation. One reader commented that the meaning of "window" in other communities (video processing) typically includes some notion of slide/stride. Conclusion was that we will support shift and stride as we already do in `sliding_window_batch()`. Stride = number of elements you skip (i.e. for non-consecutive elements in a window), shift = how much the window shifts between windows. 265 | 266 | Q: Is there any significant overhead from elements being datasets (e.g. from extra work in Python)? 267 | 268 | A: The amount of computation that you have to do to compute the batch should be the same. There is no additional work in Python. 269 | 270 | Q: How do you compile the reduce function to run it in C++? 271 | 272 | A: It's a TF function, similar to existing map functions, etc. 273 | 274 | Q: Concern about how many times count() is invoked. 275 | 276 | A: The example shows how to use it in a filter(), where the count is evaluated in a function context. 277 | 278 | Q: Re: runtime efficiency, in the higher dimensional case, would we always make a copy to concatenate? 279 | 280 | A: That's what the Dataset.batch() transformation does. The nested dataset elements aren't intended for direct consumption, but to serve as input to other transformations, which e.g. build padded batches, sparse tensors, etc. This proposal lets you mix and match how you treat the different components, as illustrated in the end-to-end example. The goal of the new API isn't to improve efficiency of the existing implementations, but to add support for new kinds of transformation. 281 | 282 | Q: What about the parallel proposal for random access datasets? Will count() be an exposed primitive or would you use the efficient random-access count? 283 | 284 | A: We would add efficient random-access count for the nested datasets produced by window(). 285 | 286 | -------------------------------------------------------------------------------- /rfcs/20180817-variables-20.md: -------------------------------------------------------------------------------- 1 | # Variables in TensorFlow 2.0 2 | 3 | | Status | Accepted | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | apassos@google.com | 6 | | **Sponsor** | wicke@google.com, joshl@google.com, ashankar@google.com | 7 | | **Updated** | 2018-08-17 | 8 | 9 | 10 | ## Objective 11 | 12 | The API for TensorFlow variables has many drawbacks: impossible-to-reason-about semantics, reliance on global scopes, and reliance on global collections. As the TensorFlow API moves to become more pythonic and object oriented, with the Keras layers and models and the object-based serialization, we no longer have a need for much of this global infrastructure around variables. 13 | 14 | 15 | ## Main changes 16 | 17 | The API for Variables will then change in the following ways for TF 2.0: 18 | 19 | 20 | 21 | * tf.Variable will become an abstract base class with a well-defined interface and a scoped factory to construct instances 22 | * users will be able to implement their own variable-like objects by subclassing tf.Variable and adding a scoped factory function to use those variables 23 | * variable_scope and get_variable will be removed 24 | * the tf 1.0 version of variable_scope and get_variable will be left in tf.compat.v1 25 | * to control variable naming users can use tf.name_scope + tf.Variable 26 | * whether a variable is shared across sessions / processes will be controlled by a constructor argument to tf.Variable; no other type of scope reuse will be done in the framework 27 | * scoped partitioning will be implemented as a factory function at first 28 | * libraries and users are encouraged to reuse variables by reusing their objects, like Keras layers do 29 | * custom_getters will have the following API: [variable_creator_scope](https://github.com/tensorflow/tensorflow/blob/567189980f7a1c2aa09a5170bd8d01a6ec37d303/tensorflow/python/ops/variable_scope.py#L2402) 30 | * the default implementation of the tf.Variable interface will be ResourceVariable 31 | * RefVariable will be kept in tf.compat.v1 and will be the default implementation for tf.compat.v1.Variable 32 | * tf.compat.v1.Variable will have a use_resource argument to control whether a resource variable or a ref variable will be created 33 | * symbols like tf.assign* will be removed in favor of methods in tf.Variable 34 | * in tf.compat.v1 these symbols will be marked as deprecated and will call the corresponding methods in the Variable object instead 35 | 36 | 37 | ## Detailed changes 38 | 39 | 40 | ### tf.Variable class 41 | 42 | The tf.Variable class will be an abstract base class which defines a tf.Variable interface. Initially this interface will have enough abstract methods such that the user-visible API of tf.Variable does not change. 43 | 44 | There will be two main implementations of this interface: RefVariable, with the legacy ref edges, available only in tf.compat.v1, and ResourceVariable, which is the default for the v2 API. PartitionedVariable, MirroredVariable, _UnreadVariable, CastVariable, etc, are other implementations which are part of the core library. None of these implementations will be publicly visible, only tf.Variable will be. 45 | 46 | Constructing variables is done by calling tf.Variable(*args, **kwargs). Under the hood this will call a hierarchy of scoped constructor functions, similar to what is now done in variable_scope.variable. Each such constructor function can do some combination of: 47 | 48 | 49 | 50 | * calling a base constructor to actually create a variable 51 | * returning preexisting variables 52 | * changing some arguments to the base constructor, and maybe calling it multiple times 53 | 54 | This is implemented by having a custom metaclass for tf.Variable which, when asked to construct a tf.Variable directly will call the factory functions, but when asked to construct subclasses of tf.Variable will do nothing and construct the child class. 55 | 56 | The tf.Variable interface will make no reference to graph collections, and tf.Variable will not add the Variable to any collections by default. tf.compat.v1.Variable, on the other hand, will have the collections argument and respect the existing semantics for it. Things which currently rely on collections (saving / loading, Optimizer.minimize, etc) will instead be expected to be passed either a list of variables or a CheckpointableBase-inheriting object. 57 | 58 | 59 | ### Variable sharing 60 | 61 | Sharing within a model will not be a part of the public API for tf.Variable. Users are strongly encouraged to share variables by sharing a reference to their objects. 62 | 63 | That said, the tf.compat.v1.variable_scope library can be made self-contained if we replace the per-graph variable scope stack with a module-global weak key dictionary from graphs to scope objects, and we call the protected methods to access graph collections. This will remain available for users who are not willing to port their libraries to have object-based sharing, as the support burden of maintaining that file in tf.compat.v1 is negligible and the volume of code written to use it is broad. 64 | 65 | 66 | ### Checkpointing 67 | 68 | Checkpointing will be done in tf 2.0 via the [object-oriented checkpointing API](https://www.tensorflow.org/api_docs/python/tf/contrib/checkpoint/Checkpointable). 69 | 70 | 71 | ### Optimizers 72 | 73 | The Optimizer.minimize method will no longer work if it's passed a Tensor and no list of variables. Users are expected to pass the list of variables to minimize wrt or pass an object which implements the CheckpointableBase interface to let the optimizer find the variables. The behavior of tf.compat.v1.Optimizer will not change. 74 | 75 | 76 | ### Assignment operations 77 | 78 | Instead of having free functions which access internal state of variables, reading from and writing to variables will be done via methods. Current tf.assign*(variable, ...) will become variable.assign*(...). tf.compat.v1 will keep the old aliases, but they will call the new methods instead. 79 | 80 | This is an easy LSC to make (once the current operations are modified to return a RefVariable object instead of a Ref tensor) and will make the code more homogeneous and pythonic. 81 | 82 | 83 | ### Ref edges versus resources 84 | 85 | TensorFlow graphs need to represent state (information which survives calls to session.run, or generally information produced by an op which depends on something other than the content of its input tensors) so most nontrivial programs can be useful. Examples of state are input pipelines, model parameters, queues, mutexes, and random number generators. 86 | 87 | There are a number of ways of representing state in TensorFlow directly in the graph, but the most robust and flexible is using resource handles. A **resource handle** is a regular immutable Tensor which represents a name to a shared out-of-graph resource (any C++ class inheriting from ResourceBase can be used as a resource). The resource handle itself doesn't change during the program execution. The resource pointed to by a handle lives on a specific device (so while it's possible to serialize resource handle tensors it's usually not a good idea), and can be accessed by any op which runs on that device and has access to the resource handle tensor. These ops can do things such as reading from the resource, modifying the resource, initializing the resource, and deleting it. 88 | 89 | A resource handle is a scalar tensor of dtype DT_RESOURCE (or dtypes.resource in Python), and can be manipulated as any other Tensor: you can concatenate resources, they can go through conditionals, you can slice into them, etc. This means that while it's often possible to determine statically whether two operations can access the same resource some graphs might be structured in ways which make this difficult. 90 | 91 | When you can determine statically that two ops touch the same resource you can make inferences about the state of the resource when one op is executing solely by looking at the graph. For example, if there is a path formed of control or data edges connecting a resource-using op O to a resource-using op O', you know that O' is guaranteed to see the effects of O on the resource and, conversely, that O is guaranteed to not see the effects of O' on the resource. If, on the other hand, there is no path in the graph connecting ops O and O' which use the same resource then whether one sees the effects of the other is undefined, and might vary from one execution to another. 92 | 93 | Resource variables were the motivating case for introducing the explicit notion of resources to TensorFlow graphs. This was done to avoid complicated issues related to the lack of a memory model for the deprecated ref-edge-based variables and allow compilation of TensorFlow graphs containing mutable state. 94 | 95 | A resource-based variable is the simplest type of resource. What's stored in the device's resource manager is a pair of a Tensor and a mutex. The main operation to read the value of a variable is read_variable_op, and it simply outputs a Tensor which has the same value as the Tensor in the resource handle state. There are many ops which write to the resource (assign_variable_op, assign_add_variable_op, resource_apply_gradient_descent, etc), and the basic properties of the resource edges ensure that it's possible to order reading and writing ops to avoid undefined behavior. 96 | 97 | These ops are currently implemented using copy-on-write, but they could also be implemented using copy-on-read or other, more complex, mechanisms, as long as the semantics of the read-before-writes and write-before-read are respected and as long as no mutation is done to the Tensor returned by a read_variable_op after it's been read. Here are two examples of why mutating a Tensor returned by a read_variable_op might be dangerous: 98 | 99 | 100 | 101 | * tf.cond predicates: a tf.cond takes a boolean tensor as a predicate and conditionally executes ops in the true or false branch of the conditional based on the value of the predicate. The way this is implemented in TensorFlow, to allow for graph pruning and non-strict execution is that there are many "switch" ops in the graph, each of which looks at the value of the predicate and decides which operations downstream from it can execute. If the predicate is a variable and one branch modifies the value of this variable, we would like to ensure that, because the "read" operation happened before the switch ops, only one branch of the conditional will execute. If, instead, writing to a variable could mutate the value of the tensor returned by "read", then a subset of both branches could execute, leading to hard-to-debug errors. 102 | * gating gradients: when computing the backward pass and training a deep neural network there is by default no in-graph order between the operation to update the parameters of a layer based on its gradients and to use the value of the parameters of the layer to compute the gradient with respect to the previous layer. If the value of a variable was allowed to change after it was read, it would be possible for the value after the update to be used in the backward pass, leading to incorrect gradients for the layers closer to the input of the network. 103 | 104 | These are just two examples of how it's much harder to reason about TensorFlow programs when the value of a variable can change after it was read. 105 | 106 | Before resource handles TensorFlow variables were represented using a "ref" edge. A ref edge is a pair of pointers, one to a Tensor and one to a mutex, owned by something other than the tf runtime. When an op takes a ref tensor its input has to be a ref tensor, but when an op takes a non-ref tensor but its input is a ref tensor the pointer is silently dereferenced. This means that normal tensor objects in the graph can silently alias a mutable tensor, and hence two ops with the same input can see it having different values. Which value will be seen can depend on execution-specific details such as whether the variables are on a local or remote device, and in general it's not easy to ensure that a read happens before or after a specific write. 107 | 108 | 109 | ### Internal resource variable ops 110 | 111 | We will expose the internal ops used to implement ResourceVariable as tf.experimental.variable_operations (name TBD). This way users and libraries can, if they need to, modify the behavior of variables at will. 112 | 113 | 114 | ## Migration plan 115 | 116 | The migration plan is roughly as follows. TODO(apassos): flesh out this section with cost estimates. 117 | 118 | 119 | 120 | 1. Implement the abstract base class and factory function scope under the hood 121 | 1. Expose the factory function scope as tf.variable_creator_scope 122 | 1. LSC to change tf.variable_scope / tf.get_variable to tf.compat.v1.* 123 | 1. Removal of tf.variable_scope and tf.get_variable from the tf 2 namespace 124 | 1. Implement the subclass to be returned from tf.assign* 125 | 1. LSC to change tf.assign*(v, …) to v.assign*(...) 126 | 1. Change the implementation of tf.compat.v1.variable_scope to not rely on a per-graph variable scope stack 127 | 1. Remove the get_variable_scope and related public methods from tf.Graph (leaving them on tf.compat.v1.Graph) 128 | 1. Implement PartitionedVariable as a subclass of the tf.Variable interface 129 | 1. Add a partitioner scope to the tf 2.0 API 130 | 1. Add a deprecation warning to the tf.compat.v1 partitioned variable scope with a migration warning 131 | 1. [questionable] Implement a variable creator factory function which calls get_variable under the hood 132 | 1. Make this function active in all tf.compat.v1 endpoints which currently call get_variable (with a decorator, probably) 133 | 1. Change the behavior in tf2 to call tf.Variable (which will redirect to tf.get_variable in tf.compat.v1, keeping the existing behavior but cleaning the codebase) 134 | 1. [WARNING: checkpoint-breaking change] drop calls to variable_scope in parts of our API which use it. Right now they are: feature_column, rnn, canned estimators, optimizer slots, TPU estimator. Most can be replaced with judicious use of name= arguments 135 | 1. [optional] Implement tf v2 make_template which does not rely on variable_scope internally and uses a factory creator function to track and reuse variables 136 | 137 | 138 | ## Questions and Discussion Topics 139 | 140 | 1. How should we deal with the deprecation of model building APIs? 141 | -------------------------------------------------------------------------------- /rfcs/20181016-optimizer-unification.md: -------------------------------------------------------------------------------- 1 | # TensorFlow 2.0: Optimizer unification 2 | 3 | | Status | Proposed | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | Francois Chollet (fchollet@google.com) | 6 | | **Sponsor** | Martin Wicke (wicke@google.com) | 7 | | **Updated** | 2018-10-16 | 8 | 9 | --- 10 | 11 | ## Context 12 | 13 | Keras has its own set of optimizers, living in `tf.keras.optimizers` (e.g. `tf.keras.optimizers.Adam`, `tf.keras.optimizers.Adadelta`). TensorFlow has also its own set of optimizers, living in `tf.train` (internally named `tf.training`), e.g. `tf.train.AdamOptimizer`, `tf.train.AdadeltaOptimizer`. 14 | 15 | TensorFlow optimizers are now the recommended way to train tf.keras models, because: 16 | 1. they are required to support eager execution. 17 | 2. they are required to support Distribution Strategies. 18 | 19 | However, there are a number of key Keras features that are broken when using TensorFlow optimizers due to limitations of the current TensorFlow optimizer API: 20 | 21 | 1) `model.save()`, for a model compiled with a TF optimizer, will not include the optimizer configuration nor the optimizer state, which prevents users from restarting training from a saved model. This is due to: 22 | - TF Optimizer instances cannot be serialized (cloned). 23 | - TF Optimizer instances do not implement the Layer/Model API for weight loading/setting. 24 | 25 | 2) The callbacks `LearningRateScheduler` and `ReduceLROnPlateau` (dynamic adaption of the optimizer's learning rate during training) will not work for a model compiled with a TF optimizer. This is due to the fact that there is no way to dynamically adjust the hyperparameters of a TF Optimizer after instantiating it. 26 | 27 | 3) By forcing TF Optimizers for Keras training, we are asking users to take on additional complexity to use Keras. It's not enough to learn ML and NN and Keras and datasets and eager mode, now they also need to be able to know the library of TF optimizers and how to configure them. This also breaks the marketing pitch of "You can run tf.keras just like the normal keras library, with only an import change". 28 | 29 | In addition, it is fairly confusing for users to have 2 distinct sets of optimizers with a different feature set. 30 | 31 | Thus we should seek to unify the Keras optimizer API and the TensorFlow optimizer API, by 1) extending the TensorFlow optimizer API, 2) replacing the tf.keras optimizers the upgraded TF optimizers. 32 | 33 | --- 34 | 35 | ## Objective 36 | 37 | - Unify `tf.train` and `tf.keras.optimizers` API: 38 | - Make all TensorFlow optimizers JSON-serializable, and make it possible to save/restore their state. 39 | - Make it possible to dynamically modify the value of the hyperparameters of all TensorFlow optimizers, in particular the learning rate. 40 | - The current way to achieve dynamic learning rates is 1) use a LR tensor with built-in decay, 2) use a callable. Both of these approaches are limited (do not support fully-dynamic rates, e.g. adapting the rate based on the current loss decrease), and not intuitive. Doing `optimizer.lr = 0.2` at arbitrary points during training is eager-first and more user-friendly. 41 | - Have a single set of optimizers (same signatures, same objects, no wrappers), introduced as a new set of classes with an updated API, importable from `tf.keras.optimizers`. These optimizers would be based on the existing `tf.contrib.optimizer_v2` optimizers (which themselves are based on `tf.train optimizers`). 42 | 43 | 44 | The old optimizers will exist in tf.compat.v1 as-is. 45 | 46 | The known breaking changes are: 47 | - Due to name changes, old checkpoints would not be loadable with the new optimizers. This is opt-in: your checkpoint won't break until you start using the new optimizers in your code (you can always import the old optimizers from tf.compat.v1). 48 | - Some arguments are getting renamed. 49 | - The `use_locking` argument is removed. 50 | 51 | --- 52 | 53 | ## Design Proposal 54 | 55 | - Add a `get_config` method on every optimizer, as well as a `from_config` class method, to serialize / deserialize an optimizer (does not include weights value, i.e. state, but only includes hyperparameter values, i.e. the arguments that can be passed to the constructor). 56 | - Add a `get_weights` and `set_weights` method, to retrieve (or set) the optimizer’s state as a list of numpy arrays -- this is necessary for compatibility with the Keras API. 57 | - Add ability to set the values of optimizer hyperparameters (i.e. the arguments that can be passed to the constructor) at any point in the lifetime of the optimizer, without having to reinstantiate it. In particular this includes the ability to change the value of the learning rate. 58 | - Add support for gradient clipping by norm and by value. 59 | - Disable reusing a single optimizer instance across multiple graphs. 60 | - Move the optimizer classes to `tf.keras.optimizers`, with revise signatures (see details below). 61 | 62 | 63 | --- 64 | 65 | ## Detailed Design 66 | 67 | ### I - Add a get_config method on every optimizer: 68 | 69 | ```python 70 | optimizer.get_config() 71 | ``` 72 | 73 | This method is already present on the Model class and every Layer class. 74 | 75 | **Returns:** 76 | - A JSON-serializable dictionary (does not contain any non-serializable data such as tensors) containing the configuration of the optimizer, i.e. its constructor arguments. For instance, for Adadelta, this would look like `{'learning_rate': 0.1, 'rho': 0.95, 'epsilon': 1e-8, 'name': 'my_optimizer'}` 77 | 78 | 79 | ### II - Add a from_config class method on every optimizer (only needs a single implementation on the base class): 80 | 81 | ```python 82 | optimizer = Adadelta.from_config(config) 83 | ``` 84 | 85 | This method is already present on the Model class and every Layer class. This method is required for Keras compatibility. 86 | 87 | **Args:** 88 | - config: A dictionary, containing the same keys as what gets returned by `get_config`. 89 | 90 | **Returns:** 91 | - An optimizer instance with the desired configuration, effectively a clone of the original optimizer (minus its state, i.e. its weight values). 92 | 93 | 94 | ### III - Add a get_weights method on every optimizer (only needs a single implementation on the base class): 95 | 96 | ```python 97 | optimizer.get_weights() 98 | ``` 99 | 100 | This method is already present on the Model class and every Layer class. 101 | 102 | **Returns:** 103 | - A flat list of Numpy arrays, in deterministic order, where each array represents the value of an internal weight of the optimizer (such as the momentum of a model weight). 104 | 105 | 106 | ### IV - Add a set_weights method on every optimizer (only needs a single implementation on the base class): 107 | 108 | ```python 109 | optimizer.set_weights(weights) 110 | ``` 111 | 112 | This method is already present on the Model class and every Layer class. This method is required for Keras compatibility. 113 | 114 | **Args:** 115 | - weights: A flat list of Numpy arrays, in deterministic order, same as returned by get_weights. Note that since the optimizer creates its internal weights to match the set of weights it is trying to optimize, set_weights would only match get_weights when the set of weights being optimized is equivalent. E.g.: 116 | 117 | ```python 118 | optimizer = Adadelta() 119 | _ = optimizer.get_weights() # returns empty list since the optimizer has no weights at that point 120 | model.compile(optimizer=optimizer, loss=loss) # Weights are created here 121 | weights = optimizer.get_weights() # Returns a list of numpy arrays 122 | optimizer.set_weights(weights) # This works! 123 | 124 | # This will not work since this optimizer would have a different set weight 125 | different_model.optimizer.set_weights(weights) 126 | ``` 127 | 128 | Note: if the optimizer has been called on more than a single set of weights, we should disable `get_weights` and `set_weights` since their meaning would be ambiguous. 129 | 130 | 131 | ### V - Make all optimizer hyperparameters accessible via attributes (they currently aren’t retrievable): 132 | 133 | ```python 134 | optimizer = Adadelta(learning_rate=0.2) 135 | optimizer.learning_rate # returns learning rate tensor 136 | ``` 137 | 138 | This should generally work for any numerical parameter that can be passed to the constructor. 139 | 140 | 141 | ### VI - Make the following work on every optimizer, in both eager and graph modes: 142 | 143 | ```python 144 | optimizer = Adadelta(learning_rate=0.2) 145 | optimizer.learning_rate = 0.1 146 | ``` 147 | 148 | This should generally work for any numerical parameter that can be passed to the constructor. 149 | 150 | In graph mode, this would require 1) creating TF variables for these parameters in the constructor, 2) overriding `__setattr__` to do an assign on the target parameter using the default session. 151 | 152 | In eager mode, there are no issues. 153 | 154 | 155 | ### VII - Add support for gradient clipping by norm or by value 156 | 157 | The following arguments should be supported on all optimizers (it only requires a single shared implementation in the base class): 158 | 159 | ```python 160 | Adadelta(clip_norm=0.) 161 | Adadelta(clip_value=0.) 162 | ``` 163 | 164 | 165 | ### VIII - Unify optimizer signatures across Keras and tf.train. 166 | 167 | Optimizers would live in `tf.keras.optimizers`. The old optimizers would remain in `tf.compat.v1`. 168 | 169 | The set of new optimizers would be: 170 | 171 | - SGD (aliased to GradientDescent, corresponds to both GradientDescentOptimizer and MomentumOptimizer) 172 | - Adadelta 173 | - Adagrad 174 | - Adam 175 | - FTRL (not yet in Keras) 176 | - RMSProp 177 | - Adamax (not yet in TF) 178 | - Nadam (not yet in TF) 179 | 180 | We will remove `ProximalGradientDescent` and `ProximalAdagrad` (they will stay in `tf.compat.v1`). They do not appear to be used by a critical mass of users. 181 | 182 | The implementation of these optimizers would be essentially the same as that of current TF optimizers, with slight occasional changes to support new functionality (rare). However, the signature of these optimizers would change significantly, as described below. There would also be changes in the core Keras API. These changes would be made fully backwards compatible via API conversion decorators (similar to what we did when we changed the Keras API from 1.0 to 2.0) and would be replicated in both tf.keras and external Keras. 183 | 184 | Signature details below. 185 | 186 | 187 | ### SGD 188 | 189 | Current TF signatures: 190 | 191 | ```Python 192 | GradientDescentOptimizer(learning_rate, use_locking=False, name="GradientDescent") 193 | MomentumOptimizer(learning_rate, momentum, use_locking=False, name="Momentum", use_nesterov=False) 194 | ``` 195 | 196 | Current Keras signature: 197 | 198 | ```Python 199 | keras.optimizers.SGD(lr=0.01, momentum=0.0, decay=0.0, nesterov=False) 200 | ``` 201 | 202 | Proposed signature: 203 | 204 | ```Python 205 | SGD(learning_rate=0.001, 206 | momentum=0.0, 207 | decay=0.0, 208 | nesterov=False, 209 | name='SGD') 210 | ``` 211 | 212 | **Notes:** 213 | - Optimizers should not require positional arguments, especially if some do and some don’t (like now), and especially if the set of required positional arguments changes from optimizer to optimizer. For the best UX, all arguments should have a reasonable default value. 214 | - The implementation of SGD with/without momentum is not sufficiently different to justify two distinct classes. A single SGD class provides a better UX. 215 | - Public API arguments should not be about internal implementation details that cannot be readily understood by users (e.g. `use_locking`). 216 | 217 | ### Adadelta 218 | 219 | Current TF signature: 220 | 221 | ```Python 222 | AdadeltaOptimizer(learning_rate=0.001, rho=0.95, epsilon=1e-8, 223 | use_locking=False, name="Adadelta") 224 | ``` 225 | 226 | Current Keras signature: 227 | 228 | ```Python 229 | Adadelta(lr=1.0, rho=0.95, epsilon=None, decay=0.0) 230 | ``` 231 | 232 | Proposed signature: 233 | 234 | ```Python 235 | Adadelta(learning_rate=0.001, 236 | rho=0.95, 237 | epsilon=1e-8, 238 | decay=0.0, 239 | name="Adadelta") 240 | ``` 241 | 242 | **Notes:** 243 | - `epsilon=None` in Keras means “use the global default value for the epsilon fuzz factor” (typically `1e-7`). Should we also keep this behavior in the new API or should we have explicit values in the signatures? This applies to all optimizers. 244 | 245 | 246 | ### Adagrad 247 | 248 | Current TF signature: 249 | 250 | ```Python 251 | AdagradOptimizer(learning_rate, initial_accumulator_value=0.1, use_locking=False, name="Adagrad") 252 | ``` 253 | 254 | Current Keras signature: 255 | 256 | ```Python 257 | Adagrad(lr=0.01, epsilon=None, decay=0.0) 258 | ``` 259 | 260 | Proposed signature: 261 | 262 | ```Python 263 | Adagrad(learning_rate=0.001, 264 | epsilon=1e-8, 265 | decay=0.0, 266 | initial_accumulator_value=0.1, 267 | name="Adagrad") 268 | ``` 269 | 270 | ### Adam 271 | 272 | Current TF signature: 273 | 274 | ```Python 275 | AdamOptimizer(learning_rate=0.001, beta1=0.9, beta2=0.999, epsilon=1e-8, use_locking=False, name="Adam") 276 | ``` 277 | 278 | Current Keras signature: 279 | 280 | ```Python 281 | Adam(lr=0.001, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0, amsgrad=False) 282 | ``` 283 | 284 | Proposed signature: 285 | 286 | ```Python 287 | Adam(learning_rate=0.001, 288 | beta_1=0.9, 289 | beta_2=0.999, 290 | epsilon=1e-8, 291 | decay=0.0, 292 | amsgrad=False, 293 | name="Adam") 294 | ``` 295 | 296 | ### FTRL 297 | 298 | Current TF signature: 299 | 300 | ```Python 301 | FtrlOptimizer(learning_rate, 302 | learning_rate_power=-0.5, 303 | initial_accumulator_value=0.1, 304 | l1_regularization_strength=0.0, 305 | l2_regularization_strength=0.0, 306 | use_locking=False, 307 | name="Ftrl", 308 | accum_name=None, 309 | linear_name=None, 310 | l2_shrinkage_regularization_strength=0.0) 311 | ``` 312 | 313 | Proposed signature: 314 | 315 | ```Python 316 | FTRL(learning_rate, 317 | learning_rate_power=-0.5, 318 | initial_accumulator_value=0.1, 319 | l1_regularization_strength=0.0, 320 | l2_regularization_strength=0.0, 321 | name="FTRL", 322 | l2_shrinkage_regularization_strength=0.0) 323 | ``` 324 | 325 | 326 | ### RMSProp 327 | 328 | Current TF signature: 329 | 330 | ```Python 331 | RMSPropOptimizer(learning_rate, 332 | decay=0.9, 333 | momentum=0.0, 334 | epsilon=1e-10, 335 | use_locking=False, 336 | centered=False, 337 | name="RMSProp") 338 | ``` 339 | 340 | Current Keras signature: 341 | 342 | ```Python 343 | RMSprop(lr=0.001, rho=0.9, epsilon=None, decay=0.0) 344 | ``` 345 | 346 | Proposed signature: 347 | 348 | ```Python 349 | RMSProp(learning_rate=0.001, 350 | rho=0.9, 351 | epsilon=1e-8, 352 | decay=0.0, 353 | centered=False, 354 | name="RMSProp") 355 | ``` 356 | 357 | **Notes:** 358 | - The `rho` argument was named `decay` in TF. The `decay` argument is a standard argument on all adaptive learning-rate optimizers. 359 | 360 | 361 | ### Adamax 362 | 363 | Current Keras signature: 364 | 365 | ```Python 366 | Adamax(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, decay=0.0) 367 | ``` 368 | 369 | Proposed signature: 370 | 371 | ```Python 372 | Adamax(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.0, name="Adamax") 373 | ``` 374 | 375 | 376 | ### Nadam 377 | 378 | Current Keras signature: 379 | 380 | ```Python 381 | Nadam(lr=0.002, beta_1=0.9, beta_2=0.999, epsilon=None, schedule_decay=0.004) 382 | ``` 383 | 384 | Proposed signature: 385 | 386 | ```Python 387 | Nadam(learning_rate=0.001, beta_1=0.9, beta_2=0.999, epsilon=1e-8, decay=0.0, name="Nadam") 388 | ``` 389 | 390 | 391 | 392 | --- 393 | 394 | ## Questions and Discussion Topics 395 | 396 | - Do you have a use case where you need to reuse an optimizer across different sets of weights? (note: this will still be doable with this proposal) Describe your use case. 397 | - Do you use the `centered` or `initial_accumulator_value` arguments? 398 | - Do you use the `use_locking` argument? Describe your use case. 399 | -------------------------------------------------------------------------------- /rfcs/20181214-move-to-addons.md: -------------------------------------------------------------------------------- 1 | # Move from tf.contrib to tensorflow/addons 2 | 3 | | Status | Accepted | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | Sean Morgan (seanmorgan@outlook.com), Armando Fandango (armando@neurasights.com) | 6 | | **Sponsor** | Karmel Allison (karmel@google.com) | 7 | | **Updated** | 2018-12-16 | 8 | 9 | ## Objective 10 | 11 | With the upcoming removal of tf.contrib in TF 2.0, we are in the process 12 | of deciding what existing functionality will be moved and maintained in 13 | the [tensorflow/addons](https://github.com/tensorflow/addons) 14 | repository. 15 | 16 | This document details what functionality the SIG plans to move and 17 | invites discussion around the decisions. 18 | 19 | 20 | ## Motivation 21 | 22 | In this RFC, we are soliciting discussion regarding what tf.contrib code 23 | should be moved to tensorflow/addons. This RFC discussion will help us 24 | determine the value of moving code and their respective 25 | maintainability aspects. 26 | 27 | ## Design Proposal 28 | 29 | ### Criteria for moving 30 | 1) The functionality is not otherwise available in TensorFlow 31 | 1) There is sufficient interest in the community to maintain the code being moved 32 | 1) The code conforms to an established API pattern (Some pieces can be refactored if needed) 33 | 34 | It is worth noting that just because some functionality isn't part of 35 | the initial move, does not mean it won't be eventually part of addons 36 | if there is value. We will begin reviewing pull requests to the 37 | repository after the directory structure is shaped during the initial move. 38 | 39 | ### Code to be moved from tf.contrib to addons 40 | 41 | | Module (tf.contrib) | Class/Function | Rationale | 42 | |:----------------------- |:----------- |:------------------------------------ | 43 | | opt.external_optimizer | ExternalOptimizerInferface | Base class for external optimizers used in OSS projects | 44 | | opt.external_optimizer | ScipyOptimizerInterface | Significant usage in OSS projects | 45 | | opt.lazy_adam_optimizer | LazyAdamOptimizer | Significant usage in OSS projects / discussions | 46 | | opt.moving_average_optimizer | MovingAverageOptimizer | Significant usage in OSS projects | 47 | | layers.layers | dense_to_sparse | Useful functionality and discussion around it | 48 | | layers.layers | layer_norm | Heavily used is OSS projects / From impactful paper | 49 | | layers.layers | maxout | From impactful paper | 50 | | layers.layers | poincare_normalize | Functionality not available / Useful for hyperbolic embeddings | 51 | | layers.normalization | instance_norm | Heavily used is OSS projects / Used for style xfer | 52 | | layers.normalization | group_norm | Will be moved as a generalized case of layer_norm and instance_norm | 53 | | losses.metric_loss_ops | pairwise_distance | Useful functionality not otherwise available | 54 | | losses.metric_loss_ops | contrastive_loss | Useful functionality not otherwise available | 55 | | losses.metric_loss_ops | masked_maximum | Useful functionality not otherwise available | 56 | | losses.metric_loss_ops | masked_minimum | Useful functionality not otherwise available | 57 | | losses.metric_loss_ops | triplet_semihard_loss | Useful functionality not otherwise available / From impactful paper | 58 | | losses.metric_loss_ops | npairs_loss | Useful functionality not otherwise available | 59 | | losses.metric_loss_ops | npairs_loss_multilabel | Useful functionality not otherwise available | 60 | | losses.metric_loss_ops | lifted_struct_loss | Useful functionality not otherwise available | 61 | | sparsemax.sparsemax | ALL | Useful functionality not otherwise available / Volunteers to maintain | 62 | | image.dense_image_warp | dense_image_warp | Useful functionality not otherwise available | 63 | | image.distort_image_ops | random_hsv_in_yiq | Useful functionality not otherwise available | 64 | | image.distort_image_ops | adjust_hsv_in_yiq | Useful functionality not otherwise available | 65 | | image.image_ops | rotate | Useful functionality not otherwise available / Several uses in OSS found | 66 | | image.image_ops | translate | Useful functionality not otherwise available | 67 | | image.image_ops | angles_to_projective_transforms | Useful functionality not otherwise available / Several uses in OSS found | 68 | | image.image_ops | translations_to_projective_transforms | Useful functionality not otherwise available | 69 | | image.image_ops | transform | Useful functionality not otherwise available / Several uses in OSS found | 70 | | image.image_ops | compose_transforms | Useful functionality not otherwise available / Several uses in OSS found | 71 | | image.image_ops | flat_transforms_to_matrices | Helper util used a few times in module | 72 | | image.image_ops | matrices_to_flat_transforms | Helper util used a few times in module | 73 | | image.image_ops | connected_components | Useful functionality not otherwise available | 74 | | text.skip_gram_ops | ALL | Useful functionality not otherwise available | 75 | | crf.crf | ALL | Heavily used by the NLP community | 76 | | opt.weight_decay_optimizers | DecoupledWeightDecayExtension | ~SOTA convergence speeds / Needs refactored as Wrapper subclass | 77 | | opt.weight_decay_optimizers | AdamWOptimizer | ~SOTA convergence speeds / Needs refactored as wrapper + keras Adam | 78 | | opt.weight_decay_optimizers | MomentumWOptimizer | ~SOTA convergence speeds / Needs refactored as wrapper + keras SGD| 79 | 80 | ### Code that will not be moved from tf.contrib pending objections 81 | 82 | | Module (tf.contrib) | Class/Function | Rationale | 83 | |:----------------------- |:----------- |:------------------------------------ | 84 | | opt.addsign | AddSignOptimizer | No OSS uses found / Needs refactored as OptimizerV2 subclass | 85 | | opt.agn_optimizer | AGNOptimizer | No OSS uses found / Needs refactored as OptimizerV2 subclass | 86 | | opt.drop_stale_gradient_optimizer | DropStaleGradientOptimizer | No OSS uses found / Needs refactored as Wrapper subclass | 87 | | opt.elastic_average_optimizer | ElasticAverageOptimizer | No OSS uses found / Needs refactored as Wrapper subclass | 88 | | opt.ggt | GGTOptimizer | No OSS uses found | 89 | | opt.lars_optimizer | LARSOptimizer | No OSS uses found / Needs refactored as OptimizerV2 subclass | 90 | | opt.shampoo | ShampooOptimizer | No OSS uses found / Needs refactored as OptimizerV2 subclass | 91 | | opt.matrix_functions | matrix_inverse_pth_root | Used in opt.shampoo | 92 | | opt.model_average_optimizer | ModelAverageOptimizer | No OSS uses found / Needs refactored as Wrapper subclass | 93 | | opt.multitask_optimizer_wrapper | MultitaskOptimizerWrapper | No OSS uses found / Needs refactored as Wrapper subclass | 94 | | opt.multitask_optimizer_wrapper | clip_gradients_by_global_norm | No OSS uses found / Specific to MultitaskOptimizers / At least partly covered in Keras optimizer | 95 | | opt.powersign | PowerSignOptimizer | No OSS uses found / Needs refactored as OptimizerV2 subclass | 96 | | opt.sign_decay | get_linear_decay_fn | No OSS usage / Used in AddSign & PowerSign | 97 | | opt.sign_decay | get_cosine_decay_fn | No OSS usage / Not an optimizer | 98 | | opt.sign_decay | get_restart_decay_fn | No OSS usage / Not an optimizer | 99 | | opt.reg_adagrad_optimizer | RegAdagradOptimizer | No OSS uses found / Needs refactored as keraas Adagrad subclass | 100 | | opt.variable_clipping_optimizer | VariableClippingOptimizer | No OSS uses found / Needs refactored as Wrapper subclass / partial covered by keras norm clip | 101 | | opt.weight_decay_optimizers | ShampooWOptimizer | No OSS uses found | 102 | | opt.weight_decay_optimizers | extend_with_decoupled_weight_decay | No OSS uses found / Functional paradigm - factory function | 103 | | layers.embedding_ops | scattered_embedding_lookup_sparse | No OSS uses found | 104 | | layers.embedding_ops | embedding_lookup_unique | No OSS uses found | 105 | | layers.encoders | bow_encoder | Creates variables, but does not subclass Layer | 106 | | layers.encoders | embed_sequence | Creates variables, but does not subclass Layer | 107 | | layers.layers | convolution2d_in_plane | No OSS uses found | 108 | | layers.layers | GDN | No OSS uses found | 109 | | layers.layers | scale_gradient | No OSS uses found | 110 | | layers.layers | sequence_to_images | No OSS uses found | 111 | | layers.layers | spatial_softmax | One OSS project found / Needs refactored as base Layer subclass / Uses get_variable_collections | 112 | | layers.optimizers | optimize_loss | Concience wrapper to build a training op / Would need refactor to stick to TF2.0 APIs | 113 | | layers.optimizers | adaptive_clipping_fn | No OSS uses found | 114 | | layers.rev_block_lib | RevBlock | No OSS uses found | 115 | | layers.rev_block_lib | recompute_grad | No OSS uses found | 116 | | layers.summaries | summarize_tensor | One OSS project found / Very simple wrapper | 117 | | layers.utils | constant_value | Simple wrapper... need a good reason to support | 118 | | nn.alpha_dropout | alpha_dropout | No OSS uses found / Needs refactored as base Layer subclass | 119 | | nn.fwd_gradients | fwd_gradients | No OSS uses found | 120 | | nn.sampling_ops | rank_sampled_softmax_loss | One OSS use found / Needs to utilize sampled_softmax_loss_v2 | 121 | | nn.sampling_ops | sampled_sparse_softmax_loss | No OSS uses found / Needs to utilize sampled_softmax_loss_v2 | 122 | | nn.scaled_softplus | scaled_softplus | No OSS uses found | 123 | | losses.metric_loss_ops | update_1d_tensor | No OSS uses found / Large amount of code related to cluster_loss | 124 | | losses.metric_loss_ops | get_cluster_assignment | No OSS uses found / Large amount of code related to cluster_loss | 125 | | losses.metric_loss_ops | compute_facility_energy | No OSS uses found / Large amount of code related to cluster_loss | 126 | | losses.metric_loss_ops | compute_clustering_score | No OSS uses found / Large amount of code related to cluster_loss | 127 | | losses.metric_loss_ops | compute_augmented_facility_locations | No OSS uses found / Large amount of code related to cluster_loss | 128 | | losses.metric_loss_ops | update_medoid_per_cluster | No OSS uses found / Large amount of code related to cluster_loss | 129 | | losses.metric_loss_ops | update_all_medoids | No OSS uses found / Large amount of code related to cluster_loss | 130 | | losses.metric_loss_ops | compute_augmented_facility_locations_pam | No OSS uses found / Large amount of code related to cluster_loss | 131 | | losses.metric_loss_ops | compute_gt_cluster_score | No OSS uses found / Large amount of code related to cluster_loss | 132 | | losses.metric_loss_ops | cluster_loss | No OSS uses found / Large amount of code related to cluster_loss | 133 | | image.image_ops | bipartite_match | No OSS uses found / Should live in linalg or somewhere else? | 134 | | image.interpolate_spline | interpolate_spline | One OSS uses found / Should live in tf.signal? | 135 | | image.single_image_random_dot_stereograms | single_image_random_dot_stereograms | No OSS uses found | 136 | | image.parse_image_wrap | sparse_image_warp | No OSS uses found | 137 | | resampler.resampler_ops | ALL | Pending community interest | 138 | | solvers | ALL | Pending community interest to maintain | 139 | | integrate | ALL | Pending community interest to maintain | 140 | 141 | ### Code that will not be copied from tf.contrib to addons and hence would not be available in either of tf.contrib or addons 142 | 143 | | Module (tf.contrib) | Class/Function | Rationale | 144 | |:----------------------- |:----------- |:------------------------------------ | 145 | | opt.adamax | AdaMaxOptimizer | Available in tf.keras.optimizers | 146 | | opt.matrix_functions | matrix_square_root | Available as linalg_ops.matrix_square_root | 147 | | opt.nadam_optimizer | NadamOptimizer | Available in tf.keras.optimizers | 148 | | layers.embedding_ops | safe_embedding_lookup_sparse | Exists as tf.nn.safe_embedding_lookup_sparse | 149 | | layers.embedding_ops | embedding_lookup_sparse_with_distributed_aggregation | Replaced by emedding_lookup_sparse_v2 | 150 | | layers.feature_column | ALL | Better version available in tf.feature_column | 151 | | layers.initizalizers | xavier_initializer | tf.keras has a glorot_normal and glorot_uniform | 152 | | layers.initizalizers | variance_scaling_initializer | Exists in tf.keras.initializers | 153 | | layers.layers | avg_pool2d | Exists in tf.keras.layers | 154 | | layers.layers | avg_pool3d | Exists in tf.keras.layers | 155 | | layers.layers | batch_norm | Exists in tf.keras.layers | 156 | | layers.layers | bias_add | Exists in tf.keras.layers | 157 | | layers.layers | conv1d | Exists in tf.keras.layers | 158 | | layers.layers | conv2d | Exists in tf.keras.layers | 159 | | layers.layers | conv3d | Exists in tf.keras.layers | 160 | | layers.layers | conv2d_in_plane | Functional Alias | 161 | | layers.layers | conv2d_transpose | Exists in tf.keras.layers | 162 | | layers.layers | conv3d_transpose | Exists in tf.keras.layers | 163 | | layers.layers | convolution | Exists in tf.keras.layers | 164 | | layers.layers | convolution1d | Exists in tf.keras.layers | 165 | | layers.layers | convolution2d | Exists in tf.keras.layers | 166 | | layers.layers | convolution2d_transpose | Exists in tf.keras.layers | 167 | | layers.layers | convolution3d | Exists in tf.keras.layers | 168 | | layers.layers | convolution3d_transpose | Exists in tf.keras.layers | 169 | | layers.layers | dropout | Exists in tf.keras.layers | 170 | | layers.layers | elu | Exists in tf.keras.layers | 171 | | layers.layers | flatten | Exists in tf.keras.layers | 172 | | layers.layers | fully_connected | Exists in tf.keras.layers | 173 | | layers.layers | gdn | Functional interface of GDN | 174 | | layers.layers | images_to_sequence | No OSS uses found / Functional paradigm | 175 | | layers.layers | linear | Exists in tf.keras.layers | 176 | | layers.layers | pool | Exists in tf.keras.layers | 177 | | layers.layers | max_pool2d| Exists in tf.keras.layers | 178 | | layers.layers | max_pool3d | Exists in tf.keras.layers | 179 | | layers.layers | one_hot_encoding | Exists in tf.keras / Uses collections | 180 | | layers.layers | relu | Exists in tf.keras.layers | 181 | | layers.layers | relu6 | Exists in tf.keras.layers | 182 | | layers.layers | repeat | Exists as sequential model | 183 | | layers.layers | separable_conv2d | Exists in tf.keras.layers | 184 | | layers.layers | separable_convolution2d | Exists in tf.keras.layers | 185 | | layers.layers | softmax | Exists in tf.keras.layers | 186 | | layers.layers | stack | Exists as sequential model / Uses variable scoping | 187 | | layers.layers | unit_norm | Exists in linalg | 188 | | layers.layers | legacy_fully_connected | Legacy layer | 189 | | layers.layers | legacy_linear | Legacy layer | 190 | | layers.layers | legacy_relu | Legacy layer | 191 | | layers.regularizers | l1_regularizer | Available in tf.keras.regularizers | 192 | | layers.regularizers | l2_regularizer | Available in tf.keras.regularizers | 193 | | layers.regularizers | l1_l2_regularizer | Available in tf.keras.regularizers | 194 | | layers.regularizers | sum_regularizer | Trivial convience wrapper | 195 | | layers.regularizers | apply_regularization | Uses collections | 196 | | layers.rev_block_lib | rev_block | Functional paradigm for RevBlock | 197 | | layers.summaries | summarize_tensors | Trivial list comprehension | 198 | | layers.summaries | summarize_collection | Uses collections | 199 | | layers.summaries | summarize_activations | Uses collecftions | 200 | | layers.target_column | ALL | Deprecated since Estimators | 201 | | layers.utils | collect_named_output | Unsupported tensor alias API | 202 | | layers.utils | append_tensor_alias | Unsupported tensor alias API | 203 | | layers.utils | gather_tensors_aliases | Unsupported tensor alias API | 204 | | layers.utils | get_tensor_aliases | Unsupported tensor alias API | 205 | | layers.utils | convert_collection_to_dict | Uses collections | 206 | | layers.utils | static_cond | Simple wrapper / No OSS use | 207 | | layers.utils | smart_cond | Simple wrapper / Little OSS use | 208 | | layers.utils | get_variable_collections | Uses collections | 209 | | layers.utils | channel_dimension | Simple wrapper / No OSS use | 210 | | layers.utils | last_dimension | Simple wrapper / No OSS use | 211 | | layers.utils | two_element_tuple | No OSS use | 212 | | layers.utils | n_positive_integers | No OSS use | 213 | | nn.cross_entropy | ALL | Deprecated Losses | 214 | | losses.loss_ops | ALL | Available in core tf.losses | 215 | 216 | 217 | 218 | **Notes:** 219 | * More details of our code review can be found in [this spreadsheet](https://docs.google.com/spreadsheets/d/1hYJchHp1y1t2U6htq5UXxMxWlGxxtOyyNHDF8_qhtQQ/edit#gid=185512613) 220 | * We used [this analysis tool](https://tf-contrib-analyzer.herokuapp.com/) to detect OSS usage. 221 | 222 | ## Questions and Discussion Topics 223 | 224 | * Are there any modules being excluded from the move that you feel have substantial value to the community? 225 | * Are there any new modules that you feel should be added to addons from somewhere else apart from tf.contrib 226 | * We're actively collecting volunteers to help move, refactor and/or maintain in Addons (Please reachout to our [mailing list](https://groups.google.com/a/tensorflow.org/forum/#!forum/addons) 227 | or [gitter channel](https://gitter.im/tensorflow/sig-addons) if you have interest in helping our community. 228 | 229 | ## After Request Notes 230 | * Now that the review period has ended, please post all suggested 231 | additions/removals directly to the tensorflow/addons [issues page](https://github.com/tensorflow/addons/issues) -------------------------------------------------------------------------------- /rfcs/20181217-tf2-random-numbers.md: -------------------------------------------------------------------------------- 1 | # Random numbers in TensorFlow 2.0 2 | 3 | | Status | Accepted | 4 | :-------------- |:---------------------------------------------------- | 5 | | **Author(s)** | Peng Wang (wangpeng@google.com), Josh Levenberg (joshl@google.com), Alexandre Passos (apassos@google.com), Asim Shankar (ashankar@google.com) | 6 | | **Sponsor** | Josh Levenberg (joshl@google.com), Alexandre Passos (apassos@google.com) | 7 | | **Updated** | 2019-01-30 | 8 | 9 | ## Objective 10 | 11 | We'd like to revamp how we do random number facilities in TensorFlow 2.0. 12 | 13 | * Replace the current stateful random ops, which keep state in the C++ OpKernel instance. For 2.0, all state should be moved into Resources where it can be checkpointed, used to sequence access to the same state, allow us to manage that state when executing eagerly, etc. 14 | * Use stateless random ops where possible, to improve reproducibility and simplicity. For example, variable initializers should switch to stateless random ops, so that saving the initialization graph allows you to reproduce the same initial values. 15 | * Improve reproducibility 16 | * Random state is checkpointed by default. 17 | * Seeding isn't as sensitive to how many ops created in the graph so far. 18 | * Code written with eager execution should produce the same sequence when you switch to graph execution using `tf.function`. 19 | * Options for regenerating random tensors from a small amount of state. For example, dropout needs the large mask tensor used in the forward pass available in the backward pass, but we'd prefer not to hold on to it tying up memory in between. 20 | * We should switch to using the same RNG algorithm across devices, where possible. 21 | * We should reset the op seed any time we reset the global seed, to address [GitHub issue 9171](https://github.com/tensorflow/tensorflow/issues/9171). 22 | * Give the user greater control over the RNG algorithm used, to be able to select some combination of: 23 | * the same sequence across many different accelerator types 24 | * a fast implementation for a specific kind of accelerator 25 | * RNG strength (lack of observable regularities in the output) 26 | 27 | ## Motivation 28 | 29 | Switching how we do random numbers is going to break a lot of tests. We should do this once. Some of the changes are likely going to be API changes that can only happen at a major version transition, and we'd prefer to get them into 2.0 instead of waiting for 3.0. The current solution relies on using the current graph's op count, which is less unique when creating a new graph for each `tf.function`. 30 | 31 | ## Background 32 | 33 | We currently have: 34 | 35 | * `tf.set_random_seed()` to set a "graph" seed. This is currently global to a graph. This will become a "global" seed in 2.0 due to the migration away from graphs. 36 | * A bunch of stateful random ops (like `tf.random_uniform()`) that explicitly take an optional "op" seed and implicitly take the graph/global seed as attrs. If both of these seeds are zero, the kernel generates seeds nondeterministically. The Python layer ensures that both seeds are zero only when neither the graph/global and op seeds are specified (both are `None`). State is kept in the C++ kernel instance, so repeated executions of the kernel return different results. 37 | * A set of stateless random ops that have recently been moved from contrib to core. These take two seeds as input (as tensors not attrs), but always produce the same output given the same input. 38 | 39 | The contract and implementation for the stateful ops is: 40 | 41 | * If you specify either the global or op seed (i.e. at least one is not `None`), then you get deterministic/reproducible behavior. 42 | * If you specify global seed but not op seed, different ops get different seeds, but still deterministic/reproducible. [Currently](https://github.com/tensorflow/tensorflow/blob/00d91e7bc3111b00c2e679627362ec21dab64833/tensorflow/python/framework/random_seed.py#L39) this is generated using the count of ops in the current graph in graph-construction mode, and a pseudo-random sequence when executing eagerly. This pseudo-random sequence depends on a seed and the number of random ops executed (not all ops), see [Context.internal_operation_seed](https://github.com/tensorflow/tensorflow/blob/a3d634438e9cc70073faa796018b6173212e2f85/tensorflow/python/eager/context.py#L279). 43 | * If you specify neither global nor op seed (both are `None`), you get different random sequences every time, including different results if you restart the program without changing anything. Currently this is implemented by passing zero to both seed attrs to the kernel, which the kernel treats as a special case. If you set either the global or op seed, we make sure never to pass 0, 0 to the kernel, even if you say `tf.set_random_seed(0)`. 44 | * If you specify just the op seed, we use [DEFAULT_GLOBAL_SEED](https://github.com/tensorflow/tensorflow/blob/3eb7616b5459aec3dabaa4152a00de14a1fa0914/tensorflow/python/framework/random_seed.py#L29) for the global seed so you get deterministic behavior. 45 | 46 | ## Design Proposal 47 | 48 | The following represents the desired end-state, and doesn't go into detail about transitioning from our current stateful ops: 49 | 50 | ```python 51 | # random.py 52 | 53 | # A seed for random ops (stateful and stateless) will always be 1024 54 | # bits, all of which will be sent to the C++ code. The actual C++ 55 | # implementation of some algorithms may only use a lower part of the bits. 56 | # *QUESTION*: Is 1024 a good number? 57 | # *DECISION*: Yes. 58 | 59 | @tf_export("random.non_deterministic_seed") 60 | def non_deterministic_seed(): # returns an integer 61 | # *QUESTION*: Is this pure Python or an op? 62 | # *DECISION*: Op. 63 | 64 | # *QUESTION*: Should this be public? 65 | # *DECISION*: Yes. 66 | # *QUESTION*: Should this function be usable inside tf.function? 67 | # *DECISION*: Yes. 68 | @tf_export("random.create_rng_state") 69 | def create_rng_state(seed, algorithm): 70 | # seed must be an integer or stateless seed, never None. 71 | # Returns a 1-D tensor whose size depends on the algorithm. 72 | 73 | @tf_export("random.Generator") 74 | class Generator(Checkpointable): 75 | 76 | # *QUESTION*: Should this function be usable inside tf.function? 77 | # *DECISION*: Yes. 78 | def __init__(self, copy_from=None, seed=None, algorithm=None): 79 | if copy_from is None: 80 | if seed is None: 81 | seed = non_deterministic_seed() 82 | if algorithm is None: 83 | algorithm = ... // auto-select 84 | self._state_var = tf.Variable(create_rng_state(seed, algorithm)) 85 | self._alg_var = tf.Variable(algorithm) 86 | else: 87 | assert seed is None 88 | self._state_var = tf.Variable(copy_from.state) 89 | self._alg_var = tf.Variable(copy_from.algorithm) 90 | 91 | # *QUESTION*: Should this function be usable inside tf.function? 92 | # *DECISION*: Yes. 93 | def reset(self, seed): 94 | # Will be able to also change algorithm in the future 95 | state = create_rng_state(seed, self.algorithm) 96 | self._state_var.assign(state) 97 | 98 | @property 99 | def state(self): 100 | return self._state_var 101 | 102 | @property 103 | def algorithm(self): 104 | return self._alg_var 105 | 106 | # The following functions return a tensor and as a side effect update 107 | # self._state_var. 108 | def uniform(self, shape, minval=0, maxval=None, dtype=tf.float32, name=None): 109 | def normal(self, shape, mean=0.0, stddev=1.0, dtype=tf.float32, name=None): 110 | def make_seeds(self, shape=()): # generates seeds for stateless random ops 111 | def make_generators(self, count=1, name=None): 112 | # Returns a list of `count` independent `Generator` objects 113 | # ... 114 | # How to use `Generator` with distribution strategies: 115 | # - If the generator is created outside of the distributed portion, no 116 | # special treatment is needed. 117 | # - If the generator is created within the distributed portion, its 118 | # variables always get mirrored. 119 | # - If you want per-replica unsynced generators, you need to explicitly 120 | # create the generators (where len(generators)==len(replicas)) and send 121 | # them to the replicas via the `args` argument of 122 | # `DistributionStrategyExtended.call_for_each_replica`. 123 | 124 | global_generator = Generator() 125 | 126 | # This function discards the old Generator object (and the variables within), 127 | # which may be problematic with tf.function because the old object may be 128 | # captured by a 'tf.function'ed function and still be used by it. 129 | # A 'tf.function'ed function only keeps weak references to variables, 130 | # so deleting a variable and then calling that function again may raise an 131 | # error. 132 | @tf_export("random.set_global_generator") 133 | def set_global_generator(generator): 134 | global global_generator 135 | global_generator = generator 136 | 137 | @tf_export("random.get_global_generator") 138 | def get_global_generator(): 139 | return global_generator 140 | 141 | @tf_export("random.default_algorithm") 142 | def default_algorithm(): 143 | 144 | @tf_export("random.algorithms_for_device") 145 | def algorithms_for_device(device_type): 146 | """Returns a sequence of (algorithm, speed, strength) tuples.""" 147 | # Maybe run an op on that device to ask it 148 | 149 | @tf_export("random.algorithms_supported_on_all_devices") 150 | def algorithms_supported_on_all_devices(): 151 | # Pick some algorithms that we can then require all devices implement 152 | 153 | def make_seed_if_none(op_seed): 154 | global global_generator 155 | if op_seed is None: 156 | return global_generator.make_seeds() 157 | return op_seed 158 | 159 | @tf_export("initializer.random_uniform") 160 | class RandomUniform(Initializer): 161 | """Initializer that generates tensors with a uniform distribution...""" 162 | 163 | def __init__(self, minval=0, maxval=None, seed=None, dtype=dtypes.float32, 164 | algorithm=None): 165 | ... # unchanged, except for the addition of `algorithm`: 166 | if algorithm is None: 167 | algorithm = default_algorithm() 168 | self.algorithm = algorithm 169 | 170 | def __call__(self, shape, dtype=None, partition_info=None): 171 | if dtype is None: 172 | dtype = self.dtype 173 | return stateless_random_ops.stateless_random_uniform( 174 | shape, make_seed_if_none(self.seed), self.minval, self.maxval, dtype, 175 | self.algorithm) 176 | ``` 177 | 178 | We would also remove the stateful random ops from the public 2.0 API, replacing them with the stateless versions or the `tf.random.Generator` above. 179 | 180 | This pretty well achieves our objectives: 181 | 182 | * `tf.random.Generator` keeps its state in resource variables: 183 | * the Python object owns the state 184 | * can be checkpointed, etc. 185 | * Uses stateless random ops in the random initializers. The stateless seed will be a constant if the `seed` argument to the initializer is set to a non-`None` value. Otherwise it will depend on the value produced by the global op RNG. 186 | * `tf.random.Generator` used for the op seed generation and directly should work the same in graph and eager execution. 187 | * Seeding of individual ops without an op seed is dependent on the number of calls to `tf.random.make_seed_if_none()` not the number of ops in the graph. 188 | * `tf.random.Generator`'s state may be copied to another `Generator`. 189 | * Calling `tf.random.set_seed()` reinitializes the sequence of op seeds, addressing [GitHub issue 9171](https://github.com/tensorflow/tensorflow/issues/9171). 190 | * Switching to new RNG APIs are an opportunity to switch to a different RNG algorithm that can be efficiently implemented on both TPUs and GPUs. We include a number identifying the algorithm being used in the RNG state so we can be sure that different devices agree on which algorithm to use or raise an error. 191 | * Symbols moved to the `tf.random` namespace. 192 | * Additional features, like batch seeds for the stateless random ops to address DeepMind use cases. 193 | 194 | ## Questions and Discussion Topics 195 | 196 | * There is another design where there is a global variable called `global_seed`. Initializers will use it together with the op seed to determine the seed sent to the stateless random ops. The affected change is: 197 | ```python 198 | global_seed = None 199 | global_generator = Generator(seed=global_seed) 200 | DEFAULT_GLOBAL_SEED = 87654321 201 | 202 | @tf_export("random.set_seed") 203 | def set_seed(seed, algorithm=None): 204 | # reset the global seed and the global generator 205 | global global_seed, global_generator 206 | global_seed = seed 207 | if algorithm is None: 208 | algorithm = global_generator.algorithm 209 | global_generator = Generator(seed=seed, algorithm=algorithm) 210 | 211 | def _combine_seeds(global_seed, op_seed): 212 | # combines global_seed and op_seed into a seed for stateless random ops 213 | return tf.stack([global_seed, op_seed]) 214 | 215 | @tf_export("random.make_seed_if_none") 216 | def make_seed_if_none(op_seed): 217 | global global_seed, global_generator 218 | if op_seed is None: 219 | return global_generator.make_seeds() 220 | if global_seed is None: 221 | return _combine_seeds(DEFAULT_GLOBAL_SEED, op_seed) 222 | return _combine_seeds(global_seed, op_seed) 223 | ``` 224 | The motivation is to preserve the design in TensorFlow 1.x which uses a global seed and an op seed. Do we want `global_seed`? 225 | * Decision: No need for `global_seed`. 226 | * The `RandomUniform` implementation shown above has the behavior that when `seed` is not `None`, multiple `__call__` invocations return the same result. This has the advantage that it makes it easy to initialize two layers the same way when you want, and the downside that it makes it easy to accidentally initialize two layers the same way. An alternative implementation is that when `seed` is not `None`, `RandomUniform` creates a `Generator` instance from `seed`, stores it as a member, and draws samples from it. In this way, multiple `__call__` invocations return different results, but we can use `seed` to get determinism. Which of the two semantics do we want? 227 | * Decision: The first semantics (always return the same sequence when seeded). 228 | 229 | ## Design Review Notes 230 | 231 | 2019-01-17 232 | 233 | * Question: Differences between CL with implementation and github RFC? 234 | * CL matches RFC 235 | * All 4 Questions asked by RFC now have answers (counting all `tf.function`-related questions as 1). 236 | * Minor questions (e.g. naming, etc.) asked new as a result of the RFC have been responded. 237 | * New big question: device placement. 238 | * Seed size: No one objects to 1024; TF Probability team wants >= 256. 239 | * No provision for ever raising the limit, but we don't see a fundamental reason we can't use larger tensors later. 240 | * State size is separate, algorithm specific, not fixed at 1024 bits. 241 | * Question: Algorithm + state bundled together? 242 | * Makes it easier to have a single thing for reproducing a sequence. 243 | * Question: But what about changing the algorithm? 244 | * Should be supported by `Generator.reset`. Currently we have bugs related to changing the size of the variable (to match the size of an algorithm's state size). 245 | * Decision: Using ops where there is a question, which means being compatibile with `tf.function`. 246 | * Decision: If you use an initializer with a specified seed, you should get the same model if you reinitialize; if you leave the seed unspecified, get different initialization each time. 247 | * Note: we have replaced the old global seed with a global generator. 248 | * New big question: we used to assume that the global generator is on one device. How do we handle models on multiple devices? 249 | * We could allow communication to the single device to get random numbers, but it's slow and has high latency. 250 | * There are a couple different ways of having one variable per device, either having multiple variables per generator (lazily adding them as you access the generator from new devices), or multiple generators one per device (one variable each) (here we are treating `_state_var` and `_alg_var` as one variable). 251 | * Question: Regarding determinism of splitting, can we say something about the sequence you get from a seed? 252 | * Decision: require explicit splitting (i.e. `Generator.make_generators`) until we have need for an automatic solution. 253 | * Question: Should input pipeline use these random numbers? 254 | * Ex. `tf.data.Dataset.list_files` is not currently affected by this proposal. 255 | * Problem right now with 1.x RNG ops, has different behavior for dynamic rnn vs. unrolling 256 | * Question: Interaction with `tf.distribute.Strategy`; will get mirrored variable if you use `MirroredStrategy`. 257 | * Probably bad for GAN. 258 | * Question: Checkpointing the mirrored state? 259 | * Checkpointing/reviving synced mirrored state is easy. Checkpointing/reviving unsynced per-replica states is hard. 260 | * Suggestion: Require explicit split if you are going to use random numbers in training step, where you explicitly specify whether the generators you are using on each device should be in sync. Have an API for things like the dropout layer: "Give me a generator and it should be (synced/unsynced) across replicas." 261 | * Expectation is that users like `tf.probability` are fine being explicit and generally want the control. 262 | * Hopefully the decision to be explicit will make checkpointing straightforward; harder case is unsynced across replicas -- what to do if the list of devices changes? 263 | * Word from [Allen](https://github.com/allenlavoie): will at least get an error if the set of variables changes. 264 | --------------------------------------------------------------------------------