├── .gitignore ├── Makefile ├── README.md ├── _config.yml ├── _requests_for_research ├── better-sample-efficiency-for-trpo.html ├── cartpole.html ├── description2code.html ├── difference-of-value-functions.html ├── funnybot.html ├── im2latex.html ├── improved-q-learning-with-continuous-actions.html ├── infinite-symbolic-generalization.html ├── inverse-draw.html ├── multiobjective-rl.html ├── multitask-rl-with-continuous-actions.html ├── natural-q-learning.html ├── parallel-trpo.html └── q-learning-on-the-ram-variant-of-atari.html └── index.html /.gitignore: -------------------------------------------------------------------------------- 1 | _site 2 | *~ 3 | -------------------------------------------------------------------------------- /Makefile: -------------------------------------------------------------------------------- 1 | serve: 2 | jekyll serve -w -P 4001 3 | -------------------------------------------------------------------------------- /README.md: -------------------------------------------------------------------------------- 1 | **Status:** Archive (code is provided as-is, no updates expected) 2 | 3 | # Requests for Research 4 | 5 | It's easy to get started in deep learning, with many 6 | [resources](https://www.quora.com/What-are-the-best-ways-to-pick-up-Deep-Learning-skills-as-an-engineer) 7 | to learn the latest techniques. But it's harder to know what problems 8 | are worth working on. 9 | 10 | This repository contains a living collection of important and fun 11 | problems to help new people enter the field, and for enthusiastic 12 | practitioners to hone their skills. Many will require inventing new 13 | ideas. 14 | 15 | Also check out our new list: [Requests for Research 2.0](https://blog.openai.com/requests-for-research-2/) 16 | 17 | ## If you've solved a problem 18 | 19 | Please write up the problem in a Gist or paper, and open a pull 20 | request linking it in a "solutions" section for the relevant 21 | problem. (Alternatively, let us know about it in 22 | [community chat](https://gitter.im/openai/research).) 23 | 24 | The best solutions will contain both code and an explanation of your 25 | methodology. Please also feel free to report things you tried that 26 | didn't work, or anything else helpful to someone trying to learn how 27 | to do their own deep learning research. 28 | 29 | We'll accept multiple solutions to each problem, so long as each 30 | solution is materially different. 31 | 32 | ## This repository 33 | 34 | This respository hosts the source for the 35 | [requests for research](https://openai.com/requests-for-research). Feel 36 | free to open a pull request. Especially encouraged are: 37 | 38 | - Suggestions for new problems 39 | - Suggestions for improvements to existing problems 40 | - Links to your solution. 41 | 42 | ## Running this repo locally 43 | 44 | Install or upgrade `jekyll` via `gem install jekyll`. You can run this 45 | repo locally via: 46 | 47 | ``` 48 | jekyll serve -w 49 | ``` 50 | 51 | Your content will then be available at `http://127.0.0.1:4000/`. It 52 | won't be styled, but that should be enough to get started. 53 | -------------------------------------------------------------------------------- /_config.yml: -------------------------------------------------------------------------------- 1 | title: 'OpenAI' 2 | description: 'OpenAI is a non-profit artificial intelligence research company.' 3 | 4 | collections: 5 | - requests_for_research 6 | -------------------------------------------------------------------------------- /_requests_for_research/better-sample-efficiency-for-trpo.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Better sample efficiency for TRPO' 3 | summary: '' 4 | difficulty: 3 # out of 3 5 | --- 6 | 7 |

Trust Region Policy Optimization (TRPO) is a scalable implementation of 8 | second order policy gradient algorithm that is highly effective on both continuous and discrete control problems. One of the strengths of TRPO is that it is relatively easy to set its hyperparameters, as a hyperparameter setting that performs well on one task tends to perform well on many other tasks. But despite its significant advantages, the TRPO algorithm could be more data efficient. 9 |

10 | 11 |

The problem is to 12 | modify a 13 | good TRPO implementation so that it would converge on all of 14 | Gym's MuJoCo 15 | environments using 3x less experience, without a degradation in 16 | final average reward. Ideally, the new code should use the same 17 | setting of the hyperparameters for every problem.

18 | 19 |

This will be an impressive achievement, and the result will likely 20 | be scientifically significant.

21 | 22 |

When designing the code, you may find the following ideas useful: 23 |

Dynamically adjust the size of the mini-batch.
Dynamically adjust the radius of the trust region.
Use past samples with correctly-set importance weights.

28 |

29 | 30 |

31 |

Notes

32 | 33 |

This problem is very hard, as getting an improvement of this magnitude is likely to require new ideas.

34 | -------------------------------------------------------------------------------- /_requests_for_research/cartpole.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: "Cartpole: for newcomers to RL" 3 | summary: '' 4 | difficulty: 1 # out of 3 5 | --- 6 | 7 |

The Cartpole environment is one of the simplest MDPs. It is extremely low dimensional, with a four-dimensional observation space and only two actions. The goal of this exercise is to implement several RL algorithm in order to get practical experience with such methods.

8 | 9 |

The small size and simplicity of this environment makes it is possible to run very quick experiments, which is essential when learning the basics.

10 | 11 |

Start with a simple linear model (that has only four parameters), and use the sign of the weighted sum to choose between the two actions. 12 |

The random guessing algorithm: generate 10,000 random configurations of the model's parameters, and pick the one that achieves the best cumulative reward. It is important to choose the distribution over the parameters correctly.
The hill-climbing algorithm: Start with a random setting of the parameters, add a small amount of noise to the parameters, and evaluate the new parameter configuration. If it performs better than the old configuration, discard the old configuration and accept the new one. Repeat this process for some number of iterations. How long does it take to achieve perfect performance?
Policy gradient algorithm: here, instead of choosing the action as a deterministic function of the sign of the weighted sum, make it so that action is chosen randomly, but where the distribution over actions (of which there are two) depends on the numerical output of the inner product. Policy gradient prescribes a principled parameter update rule [1, 2]. Your goal is to implement this algorithm for the simple linear model, and see how long it takes to converge.

20 | 21 |

What happens to the above algorithm when the policy is a neural network with tens of thousands of parameters?

22 | 23 |

24 | 25 |

Notes

26 | 27 |

This is a simple task that is meant to help newcomers gain practical experience with implementing simple RL algorithms. 28 |

29 | 30 |

Solutions

31 | 32 |

Results and some intuition behind the algorithms at this post , and here is the code used. 33 | -------------------------------------------------------------------------------- /_requests_for_research/description2code.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Description2Code' 3 | summary: "Given some brief text describing a short program, generate the program's source code." 4 | difficulty: 3 # out of 3 5 | --- 6 | 7 | 8 |

The (extremely) ambitious goal of this request is to solve the problem 9 | of turning descriptions into code. It is outside the reach of 10 | current machine learning algorithms. 11 | However, ethancaballero 12 | has 13 | collected 5000 14 | input-output examples of programming challenges. It can be 15 | interesting to play with this small dataset, to see whether 16 | anything interesting can be achieved with an application of 17 | standard supervised learning techniques.

18 | 19 | 20 | 21 | -------------------------------------------------------------------------------- /_requests_for_research/difference-of-value-functions.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: Difference of value functions 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

Bertsekas wrote an interesting paper arguing why it might be better to learn functions that measure the difference in value between states, rather than the value of states. Implement this algorithm with neural networks and apply it to challenging Gym environments.

8 | 9 |

10 |

Notes

11 | 12 |

This idea may turn out to not be fruitful, and getting a good result may prove to be impossible.

13 | -------------------------------------------------------------------------------- /_requests_for_research/funnybot.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: Train a language model on a jokes corpus 3 | summary: '' 4 | difficulty: 1 # out of 3 5 | --- 6 | 7 |

Train a character-level language model on a corpus of jokes.

8 | 9 | 10 |

To do so, use this 200k English jokes dataset or build your own (the larger the better; if you do so, please submit a pull request and we'll link to your dataset), implement a character-level LSTM (or use an existing implementation), train it on this dataset, and draw samples from it. If successful, the output from the LSTM should be actually funny. 11 | 12 |

13 | 14 | -------------------------------------------------------------------------------- /_requests_for_research/im2latex.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Im2Latex' 3 | summary: '' 4 | difficulty: 1 # out of 3 5 | --- 6 | 7 | 8 |

Sequence-to-sequence models with attention have been enormously successful. They made it possible for neural networks to reach 9 | new levels of state of the art in machine translation, speech recognition, and syntactic parsing. Thanks to this work, neural network can now consume inputs of arbitrary shapes, and output sequences of variable length, without much effort on the practitioner's side.

10 | 11 | 12 |

Implement an attention model that takes an image of a PDF math formula, and outputs the characters of the LaTeX source that generates the formula.

13 | 14 |

15 | 16 |

Getting Started

17 | 18 |

For a quick start, download a prebuilt dataset or use these tools to build your own dataset. Alternatively, you can proceed manually with the following steps:

19 |

Download a large number papers from arXiv. There is a collection of 29,000 arXiv papers that you could get started with. It is likely that this set of 29,000 papers may contain several hundred thousand formulas, which is more than enough for getting started. As the bandwidth of arXiv is limited, it is important to be mindful of their constraints and to not write crawlers to download all the papers of arxiv.
Use a heuristic to find all the LaTeX formulas in the LaTeX source. It can be done by looking for the text that lies between \begin{equation} and \end{equation}. Here is a list of some of the places where equations can appear in latex files. Additional examples can be found here. It is likely that even a simple heuristic for extracting latex formulas should produce in excess of 100,000 equations; if not, keep refining the heuristic.
Compile images of all the formulas. To keep track of the correspondence between the latex formulas and their images, it is easiest to place exactly one formula on each page. Then, when processing the latex file, it is easy to keep track of the pages. Be sure to not render formulas so large that they exceed an entire page. Also, be sure to render the formulas in several fonts.
Train a visual attention sequence-to-sequence model (as in the Show, Attend, and Tell paper, or perhaps a different variant of visual attention) that would take an image of a formula as input, and output the latex source of the formula, one character at a time. A Theano implementation of the Show, Attend, and Tell paper can help you get started. If you wish to implement your model from scratch, TensorFlow can be a good starting point.
It takes some effort to correctly implement a sequence-to-sequence model with attention. To debug your model, we recommend that you start with a toy synthetic OCR problem, where the inputs are long images that are obtained by concatenating sequences of images of MNIST digits, and the labels should be a sequence of their classifications. While this problem can be solved without an attention model, it is useful as a sanity check, to ensure that the implementation is not badly broken.
We recommend trying the Adam optimizer.

27 | 28 |

29 | 30 |

Notes

31 | 32 |

A success here would be a very cool result and could be used to build a useful online tool.

33 | 34 |

While this is a very non-trivial project, we've marked it with a one-star difficulty rating because we know it's solvable using current methods. It is still very challenging to really do it, as it requires getting several ML components together correctly.

35 | 36 |

Solutions

37 | 38 |

Results, data set, code, and a write-up are available at http://lstm.seas.harvard.edu/latex/. The model is trained on the above data sets and uses an extension of the Show, Attend and Tell paper combined with a multi-row LSTM encoder. Code is written in Torch (based on the seq2seq-attn system), and the model is optimized using SGD. Additional experiments are run using the model to generate HTML from small webpages. 39 | -------------------------------------------------------------------------------- /_requests_for_research/improved-q-learning-with-continuous-actions.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Improved Q-learning with continuous actions' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

8 | Q-learning 9 | is one of the oldest and general reinforcement learning (RL) 10 | algorithms. It works by estimating the long term future expected 11 | return for each state-action pair. Essentially, the goal of the Q-learning 12 | algorithm is fold the long term outcome of each state-action pair 13 | into a single scalar that tells us how good the state-action pair 14 | combination is; then, we could maximize our reward by picking the 15 | action with the greatest value of the Q-function. The Q-learning 16 | algorithm has been the basis of 17 | the DQN 18 | algorithm that demonstrated that the combination of RL 19 | and deep learning is a fruitful one. 20 |

21 | 22 |

Your goal is to create a robust Q-learning implementation that can solve 23 | all Gym environments with 24 | continuous action spaces without changing hyperparameters.

25 | 26 |

You may want to use 27 | the Normalized Advantage 28 | Function (NAF) model as a starting point. It is especially 29 | interesting to experiment with variants of the NAF model: for 30 | example, try it with a diagonal covariance. It can be also interesting to explore 31 | an advantage function that uses the maximum of several quadratics, 32 | which is a convenient function because their argmax is easy to 33 | compute. 34 |

35 | 36 | 37 |

38 | 39 |

Notes

40 |

41 | This project is mainly concerned with reimplementing an existing 42 | algorithm. However, there is significant value in obtaining a very 43 | robust implementation, and there is a decent chance that new ideas 44 | will end up being required to get it working reliably, across 45 | multiple tasks. 46 |

47 | -------------------------------------------------------------------------------- /_requests_for_research/infinite-symbolic-generalization.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Reduced-Information Program Learning' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

A difficult machine learning task is that of program, or algorithm, learning. Examples of simple algorithmic tasks are RepeatCopy and ReversedAddition. Theoretically, recurrent neural networks (RNNs) are Turing-complete (if they have large enough memory and enough timesteps between reading the input and emitting the output), which means that they can model any computable function. In practice, however, it has been difficult to learn algorithms that output the intended answer on every conceivable input (as opposed to outputting the correct answer on the training data distribution only).

8 | 9 |

The Neural Programmer-Interpreter (NPI) is an example of a program-learning model which uses execution traces as its supervised signal. This is opposed to the notion of program induction, where programs must be inferred from input-output pairs. Using execution traces as a supervisory signal, the NPI learn to solve extremely hard problems and generalize to inputs of greater length. However, most interesting problems do not have execution traces available: if we knew the detailed execution traces, we would probably know how to solve the problem as well.

10 | 11 |

The challenge is thus to achieve similar results with partial execution traces. A partial execution trace provides a compact high level description of the way in which a particular instance was solved. Given an input, a partial execution trace provides a rough description of how the solution was computed, without going into all the details. Other weak supervision can provided -- such as the input-output pairs. 12 | 13 | The challenge in this problem is to design a model that is actually capable of learning from partial execution traces, and to show that it can learn algorithmic tasks (such as bubble sort, quick sort, multiplication, various string operations, etc) quickly. It is desirable to develop a model that can solve these problems using the least specific execution traces possible. 14 | 15 | 16 | 17 | 18 |

19 | 20 | 21 | 22 | -------------------------------------------------------------------------------- /_requests_for_research/inverse-draw.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'The Inverse DRAW model' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

Investigate an “Inverse DRAW” model.

8 | 9 |

10 | The DRAW model is a generative 11 | model of natural images that operates by making a large number of small 12 | contributions to an additive canvas using an attention model. The attention 13 | model used by the DRAW model identifies a small area in the image and "writes" to it. 14 |

15 | 16 |

17 | In the inverse DRAW model, there is a stochastic hidden variable and 18 | an attention model that reads from these hidden variables. The outputs 19 | of the attention model are provided to an LSTM that produces the 20 | observation one dimension (or one group of dimensions) at a time. 21 | Thus, while the DRAW model uses attention to decide where to write on the output canvas, 22 | the inverse DRAW uses attention to choose the latent variable to be used 23 | at a given timestep. The Inverse DRAW model can be seen as a 24 | Neural Turing Machine generative 25 | model that emits one dimension at a time, where the memory is a read-only latent variable. 26 |

27 | 28 | 29 |

The Inverse DRAW model is an interesting concept to explore 30 | because the dimensionality of the hidden variable is decoupled from 31 | the length of the input. In more detail, the Inverse DRAW model is 32 | a variational 33 | autoencoder, whose p-model emits the observation one dimension at 34 | a time, using attention to choose the appropriate latent variable for 35 | each visible dimension. There is a fair bit of choice in the 36 | architecture of the approximate posterior. A natural choice is to use 37 | the same architecture for the posterior, where the observation will be 38 | playing the role of the latent variables. 39 |

40 | 41 |

A useful property of the Inverse DRAW model is that its latent variables 42 | may operate at a rate that is different from the observation. This 43 | is the case because each dimension of the observation gets assigned to one 44 | hidden state. If this model were to successfully be made deep, we would get 45 | a hierarchy of representation, where each representation is operating at a 46 | variable rate, which is trained to be as well-suited as possible for the 47 | current dataset. 48 |

49 | 50 |

It would be interesting to apply this model to a text dataset, and to visualize 51 | the latent variables, as well as the precise way in which the model assigns 52 | words to latent variables. 53 |

54 |

55 | 56 |

Notes

57 | 58 |

It is a hard project as it is not clear that a models like this 59 | can be made to work with current techniques. However, it makes 60 | success all the more impressive.

61 | 62 |

The inverse DRAW model may have a cost function that's very difficult to optimize, so expect a struggle.

63 | 64 |

Solutions

65 | 66 | The code and an associated paper for a model implementing a version of the 67 | "Inverse DRAW" model is available here. 68 | -------------------------------------------------------------------------------- /_requests_for_research/multiobjective-rl.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: Multiobjective RL 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

In reinforcement learning, we often have several rewards that we 8 | care about. For example, in robotic locomotion, we want to maximize 9 | forward velocity but minimize joint torque and impact with the 10 | ground.

11 | 12 |

The standard practice is to use a reward function that is a 13 | weighted sum of these terms. However, it is often difficult to balance 14 | the factors to achieve satisfactory performance on all rewards. 15 |

16 | 17 |

Filter methods are algorithms from multi-objective 18 | optimization that seek to generate a sequence of points, so that each 19 | one is not strictly dominated by a previous one (see Nocedal & 20 | Wright, chapter 15.4).

21 | 22 |

Develop a filter method for RL that jointly optimizes a collection 23 | of reward functions, and test it on the 24 | Gym MuJoCo 25 | environments. Most of these have summed rewards; you would need to 26 | inspect the code of the environments to find the individual 27 | components.

28 | 29 |

30 | 31 |

Related work

32 | 33 | There exists some prior work on multiobjective optimization in an RL context. See the following review paper by Roijers et al. 34 | 35 |

Notes

36 | 37 |

Filter methods have not been applied to RL much, so there is a lot of uncertainty around the difficulty of the problem.

38 | -------------------------------------------------------------------------------- /_requests_for_research/multitask-rl-with-continuous-actions.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Multitask RL with continuous actions.' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 | 8 |

At present, most machine learning algorithms are trained to solve one task 9 | and one task only. But we do not necessarily train models on only one task 10 | at a time because we believe that it is the best approach in the long term; 11 | on the contrary, while we would like to use multitask learning in as many 12 | problems as possible, the multitask learning algorithms are not yet at a stage 13 | where they provide a robust and a sizeable improvement across a wide range of domains. 14 |

15 | 16 |

This sort of multitask learning should be particularly important in reinforcement 17 | learning settings, since in the long run, experience will be very expensive 18 | relative to computation and possibly supervised data. For this reason, it is 19 | worthwhile to investigate the feasibility of multitask learning using 20 | the RL algorithms that have been developed so far. 21 |

22 | 23 | 24 |

Thus the goal is to train a single neural network that can simultaneously solve a collection of MuJoCo 25 | environments in Gym. The current enviroments are dissimilar enough that it is unlikely that information can be shared between them. Therefore, your job is to create a set of similar environments that will serve as a good testbed for multi-task learning. Some possibilities include (1) bipedally walking with different limb dimensions and masses, (2) reaching with octopus arms that have different numbers of links, (3) using the same robot model for walking, jumping, and standing.

26 | 27 |

At the end of learning, the trained neural network should be told 28 | (via an additional input) which task it's running on, and 29 | achieve high cumulative reward on this task. The goal of this problem 30 | is to determine whether there is any benefit whatsoever to training a 31 | single neural network on multiple environments versus a single 32 | one, where we measure benefit via training speed.

33 | 34 |

We already know that the multitask learning on Atari has been difficult 35 | (see the relevant papers). But will 36 | multitask learning work better on MuJoCo environments? The goal is to 37 | find out.

38 | 39 |

The most interesting experiment is to train a multitask net of this kind 40 | on all but one MuJoCo environment, and then see if the resulting net 41 | can be trained more rapidly on a task that it hasn't been trained on. 42 | In other words, we hope that this kind of multitask learning can 43 | accelerate training of new tasks. If successful, the results 44 | can be significant.

45 | 46 |

47 | 48 |

Notes

49 | 50 |

It is a reasonably risky project, since there is a chance that 51 | this kind of transfer will be as difficult as it has been for 52 | Atari.

53 | -------------------------------------------------------------------------------- /_requests_for_research/natural-q-learning.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Natural Q-learning' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

Implement and test a natural version of Q-learning, and compare it to regular Q-learning.

8 | 9 |

Natural Gradient is a promising idea that has been explored in a significant number of papers and settings. Despite its appeal, modern approaches to natural gradient have not been applied to Q-learning with nonlinear function approximation. 10 |

11 | 12 |

The intuition behind natural gradient is the following: we can identify a neural network with its parameters, and use the backpropagation algorithm to slowly change the parameters to minimize the cost function. But we can also think of a neural network as of a high-dimensional manifold in the infinite-dimensional space of all possible functions, and we can, at least conceptually, run gradient descent in function space, subject to the constraint that we stay on the neural network manifold. This approach has the advantage that it does not depend on the specific parameterization used by the neural network; for example, it is known that tanh units and sigmoid units are precisely equivalent in the family of neural networks that they can represent, but their gradients are different. Thus, the choice of sigmoid versus tanh will affect the backpropagation algorithm, but it will not affect the idealized natural gradient, since natural gradient depends entirely on the neural network manifold, and we already established that the neural network manifold is unaffected by the choice of sigmoid versus tanh. If we formalize the notion of natural gradient, we get that the natural gradient direction is obtained by inverting the regular gradient by the Fisher information matrix. The result is still a challenging problem, but it can be addressed in a variety of ways, some of which are discussed in the papers linked above. But the relevant fact about natural gradient is that its behavior is much more stable and benign in a variety of settings (for example, natural gradient is relatively unaffected by the order of the data in the training set, and is highly amenable to data parallelism), which suggests that natural gradient could improve the stability of the Q-learning algorithm as well. 13 |

14 | 15 |

In this project, your goal is to figure out how to meaningfully apply natural gradient to Q-learning, and to compare the results to a good implementation of Q-learning. Thus, the first step of this project is to implement Q-learning. 16 | 17 | We recommend either staying with discrete domains (such as Atari), or continuous domains, and to use methods similar to Normalized Advantage Function (NAF). The continuous domains are easier to work with because they are of lower dimensionality and are therefore simpler, but NAF can be harder to implement than standard Q-learning.

18 | 19 |

It would be especially interesting if Natural Q-learning were 20 | capable of solving the RAM-Atari tasks.

21 | 22 |

23 | 24 |

Notes

25 | 26 |

This project isn't guaranteed to be solvable: it could be that Q-learning's occasional instability and failure has little to do with whether it is natural or not.

27 | 28 |

Solutions

29 | 30 |

NGDQN model, results, and paper trained on a discrete environment available here.

31 | -------------------------------------------------------------------------------- /_requests_for_research/parallel-trpo.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: Parallel TRPO 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

8 | As it is always desirable to train larger models on harder domains, one important area of research is parallelization. Parallelization 9 | has played an important role in deep learning, and has been especially successful in reinforcement learning. The successful development of algorithms that parallelize well will make it possible to train larger models faster, which will advance the field. 10 |

11 | 12 |

13 | The goal of this project is to implement the Trust Region Policy Optimization (TRPO) algorithm so that it would use multiple computers to achieve 15x lower wall-clock 14 | time than joschu's single-threaded implementation on the MuJoCo or Atari Gym environments. Given that TRPO is a highly stable algorithm that is extremely easy to use, a well-tuned parallel implementation could have a lot of practical significance. 15 |

16 | 17 |

You may worry that in order to solve this problem, you would need access to a large number of computers. However, it is not so, as it is straightforward to simulate a set of parallel computers using a single core. 18 | 19 |

Make sure your code remains generic and readable. 20 |

21 | 22 | 23 | 24 |

25 | 26 |

Notes

27 | 28 |

It is known that RL algorithms can be parallelized well, so we expect it to be possible to improve upon the basic implementation. What is less obvious is whether it is possible to get 15x speedup using, say, only 20x more nodes.

29 | 30 |

Solutions

31 | 32 |

Preliminary paper describing TRPO with parallel actors here , with the implementation avaiable at this repo. Current results are a 3x speedup with when using 4 cores. 33 | -------------------------------------------------------------------------------- /_requests_for_research/q-learning-on-the-ram-variant-of-atari.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'Q-learning on the RAM variant of Atari' 3 | summary: '' 4 | difficulty: 2 # out of 3 5 | --- 6 | 7 |

The Q-learning algorithm had a lot of success on learning to play Atari games when the inputs 8 | to the model are pixels. Atari games are designed to run on a computer that has a very small amount of RAM, so it can be interesting to try to learn to play Atari games when the input to the neural network is the RAM state of the Atari emulator. Surprisingly, getting Q-learning to work well when the inputs are RAM states has been unexpectedly challenging.

9 | 10 |

Thus, your goal is to develop a Q-learning implementation that can solve many Atari games when the input to the neural network is the RAM state, using the same setting of hyperparameters on all tasks. In your experiments, use the Gym Atari environments that are presented in the RAM way, where the inputs are the complete RAM state of the Atari computer.

11 | 12 |

The hope here is that in order to succeed, you'd need to invent techniques for Q-learning that are generally applicable, which will be useful.

13 | 14 |

15 | 16 |

Notes

17 | 18 |

This project might not be solvable. It would be surprising if it were to turn out that Q-learning would never succeed on the RAM variants of Atari, but there is some chance that it will turn out to be challenging.

19 | 20 |

21 | 22 |

Solutions

23 | 24 |

The preliminary results can be read in the paper and here are the instructions to run the code. The work has been accepted at Computer Games Workshop and will be presented on 9 July 2016 during IJCAI conference in New York. Feel free to pass by, if you're there!

25 | -------------------------------------------------------------------------------- /index.html: -------------------------------------------------------------------------------- 1 | --- 2 | title: 'OpenAI Requests for Research' 3 | --- 4 | 5 | {% assign items = site.requests_for_research | sort: 'difficulty' %} 6 | {% for request in items %} 7 |

8 | 9 |

10 | {{ request.content }} 11 |

12 | {% endfor %} 13 | --------------------------------------------------------------------------------

Notes

Notes

Solutions

Notes

Getting Started

Notes

Solutions

Notes

Notes

Solutions

Related work

Notes

Notes

Notes

Solutions

Notes

Solutions

Notes

Solutions

{{ request.title }}