scroll down

We explore building generative neural network models of popular reinforcement learning environments*world model* can be trained quickly in an unsupervised manner to learn a compressed spatial and temporal representation of the environment. By using features extracted from the world model as inputs to an agent, we can train a very compact and simple policy that can solve the required task. We can even train our agent entirely inside of its own hallucinated dream generated by its world model, and transfer this policy back into the actual environment.

Humans develop a mental model of the world based on what they are able to perceive with their limited senses. The decisions and actions we make are based on this internal model. Jay Wright Forrester, the father of system dynamics, defined a mental model as:

“*The image of the world around us, which we carry in our head, is just a model. Nobody in his head imagines all the world, government or country. He has only selected concepts, and relationships between them, and uses those to represent the real system.*”

To handle the vast amount of information that flows through our daily lives, our brain learns an abstract representation of both spatial and temporal aspects of this information. We are able to observe a scene and remember an abstract description thereof

One way of understanding the predictive model in our brains is that it might not be about just predicting the future in general, but predicting future sensory data given current motor actions

Take baseball for example

In many reinforcement learning (RL)

Large RNNs are highly expressive models that can learn rich spatial and temporal representations of data. However, many *model-free* RL methods in the literature often only use small neural networks with few parameters. The RL algorithm is often bottlenecked by the *credit assignment problem*^{1}

Ideally, we would like to be able to efficiently train large RNN-based agents. The backpropagation algorithm ^{2}

Although there is a large body of research relating to *model-based* reinforcement learning, this article is not meant to be a review

In this article, we present a simplified framework that we can use to experimentally demonstrate some of the key concepts from these papers, and also suggest further insights to effectively apply these ideas to various RL environments. We use similar terminology and notation as *Learning to Think*

We present a simple model inspired by our own cognitive system. In this model, our agent has a visual sensory component that compresses what it sees into a small representative code. It also has a memory component that makes predictions about future codes based on historical information. Finally, our agent has a decision-making component that decides what actions to take based only on the representations created by its vision and memory components.

Our agent consists of three components that work closely together: Vision (V), Memory (M), and Controller (C).

The environment provides our agent with a high dimensional input observation at each time step. This input is usually a 2D image frame in a video sequence. The role of the V model is to learn an abstract, compressed representation of each observed input frame.

We use a Variational Autoencoder (VAE)*latent vector* $z_t$. This compressed representation can be used to reconstruct the original image.

While it is the role of the V model to compress what the agent sees at each time frame, we also want to compress what happens over time. For this purpose, the role of the M model is to predict the future. The M model serves as a predictive model of the future $z$ vectors that V is expected to produce. Because many complex environments are stochastic in nature, we train our RNN to output a probability density function $p(z)$ instead of a deterministic prediction $z$.

In our approach, we approximate $p(z)$ as a mixture of Gaussian distribution, and train the RNN to output the probability distribution of the next latent vector $z_{t+1}$ given the current and past information made available to it.

More specifically, the RNN will model $P(z_{t+1} \; | \; a_t, z_t, h_t)$, where $a_t$ is the action taken at time $t$ and $h_t$ is the *hidden state* of the RNN at time $t$. During sampling, we can adjust a *temperature* parameter $\tau$ to control model uncertainty, as done in

This approach is known as a Mixture Density Network

The Controller (C) model is responsible for determining the course of actions to take in order to maximize the expected cumulative reward of the agent during a rollout in the environment. In our experiments, we deliberately make C as simple and small as possible, and trained separately from V and M, so that most of our agent’s complexity resides in the world model (V and M).

C is a simple single layer linear model that maps $z_t$ and $h_t$ directly to action $a_t$ at each time step:

$a_t = W_c \; [z_t \; h_t]\; + b_c$

In this linear model, $W_c$ and $b_c$ are the weight matrix and bias vector that maps the concatenated input vector $[z_t \; h_t]$ to the output action vector $a_t$.^{3}

The following flow diagram illustrates how V, M, and C interacts with the environment:

Below is the pseudocode for how our agent model is used in the `controller`

C will return the cumulative reward during a rollout in the environment.

```
def rollout(controller):
''' env, rnn, vae are '''
''' global variables '''
obs = env.reset()
h = rnn.initial_state()
done = False
cumulative_reward = 0
while not done:
z = vae.encode(obs)
a = controller.action([z, h])
obs, reward, done = env.step(a)
cumulative_reward += reward
h = rnn.forward([a, z, h])
return cumulative_reward
```

This minimal design for C also offers important practical benefits. Advances in deep learning provided us with the tools to train large, sophisticated models efficiently, provided we can define a well-behaved, differentiable loss function. Our V and M models are designed to be trained efficiently with the backpropagation algorithm using modern GPU accelerators, so we would like most of the model’s complexity, and model parameters to reside in V and M. The number of parameters of C, a linear model, is minimal in comparison. This choice allows us to explore more unconventional ways to train C — for example, even using evolution strategies (ES)

To optimize the parameters of C, we chose the Covariance-Matrix Adaptation Evolution Strategy (CMA-ES)

For more specific information about the models, training procedures, and environments used in our experiments, please refer to the Appendix.

A predictive world model can help us extract useful representations of space and time. By using these features as inputs of a controller, we can train a compact and minimal controller to perform a continuous control task, such as learning to drive from pixel inputs for a top-down car racing environment^{4}

In this environment, the tracks are randomly generated for each trial, and our agent is rewarded for visiting as many tiles as possible in the least amount of time. The agent controls three continuous actions: steering left/right, acceleration, and brake.

To train our V model, we first collect a dataset of 10,000 random rollouts in the environment. We have first an agent acting randomly to explore the environment multiple times, and record the random actions $a_t$ taken and the resulting observations from the environment.^{5}

We can now use our trained V model to pre-process each frame at time $t$ into $z_t$ to train our M model. Using this pre-processed data, along with the recorded random actions $a_t$ taken, our MDN-RNN can now be trained to model $P(z_{t+1} \; | \; a_t, z_t, h_t)$ as a mixture of Gaussians.^{6}

In this experiment, the world model (V and M) has no knowledge about the actual reward signals in the environment. Its task is simply to compress and predict the sequence of image frames observed. Only the Controller (C) Model has access to the reward information from the environment. Since the there are a mere 867 parameters inside the linear controller model, evolutionary algorithms such as CMA-ES are well suited for this optimization task.

The figure below compares actual the observation given to the agent and the observation captured by the world model. We can use the VAE to reconstruct each frame using $z_t$ at each time step to visualize the quality of the information the agent actually sees during a rollout:

To summarize the Car Racing experiment, below are the steps taken:

- Collect 10,000 rollouts from a random policy.
- Train VAE (V) to encode frames to latent vector $z \in \mathcal{R}^{32}$.
- Train MDN-RNN (M) to model $P(z_{t+1} \; | \; a_t, z_t, h_t)$.
- Define Controller (C) as $a_t = W_c \; [z_t \; h_t]\; + \; b_c$.
- Use CMA-ES to solve for a $W_c$ and $b_c$ that maximizes the expected cumulative reward.

Model | Parameter Count |
---|---|

VAE | 4,348,547 |

MDN-RNN | 422,368 |

Controller | 867 |

Training an agent to drive is not a difficult task if we have a good representation of the observation. Previous works

Although the agent is still able to navigate the race track in this setting, we notice it wobbles around and misses the tracks on sharper corners. This handicapped agent achieved an average score of 632 $\pm$ 251 over 100 random trials, in line with the performance of other agents on OpenAI Gym’s leaderboard

The representation $z_t$ provided by our V model only captures a representation at a moment in time and doesn’t have much predictive power. In contrast, M is trained to do one thing, and to do it really well, which is to predict $z_{t+1}$. Since M’s prediction of $z_{t+1}$ are produced from the RNN’s hidden state $h_t$ at time $t$, this vector is a good candidate for the set of learned features we can give to our agent. Combining $z_t$ with $h_t$ gives our controller C a good representation of both the current observation, and what to expect in the future.

Indeed, we see that allowing the agent to access the both $z_t$ and $h_t$ greatly improves its driving capability. The driving is more stable, and the agent is able to seemingly attack the sharp corners effectively. Furthermore, we see that in making these fast reflexive driving decisions during a car race, the agent does not need to *plan ahead* and roll out hypothetical scenarios of the future. Since $h_t$ contain information about the probability distribution of the future, the agent can just query the RNN instinctively to guide its action decisions. Like a seasoned Formula One driver or the baseball player discussed earlier, the agent can instinctively predict when and where to navigate in the heat of the moment.

Method | $\;\;$ Average Score over 100 Random Tracks $\;\;$ |
---|---|

DQN |
343 $\pm$ 18 |

A3C (continuous) |
591 $\pm$ 45 |

A3C (discrete) |
652 $\pm$ 10 |

ceobillionaire’s algorithm (unpublished) |
838 $\pm$ 11 |

V model only, $z$ input | 632 $\pm$ 251 |

V model only, $z$ input with a hidden layer | 788 $\pm$ 141 |

Full World Model, $z$ and $h$ |
906 $\pm$ 21 |

Our agent was able to achieve a score of 906 $\pm$ 21 over 100 random trials, effectively solving the task and obtaining new state of the art results. Previous attempts

Since our world model is able to model the future, we are able to have it hallucinate hypothetical car racing scenarios on its own. We can ask it to produce the probability distribution of $z_{t+1}$ given the current states, *sample* a $z_{t+1}$ and use this sample as the real observation. We can put our trained C back into this hallucinated environment generated by M. The following demo shows how our world model can be used to hallucinate the car racing environment:

We have just seen that a policy learned in the real environment appears to somewhat function inside of the dream environment. This begs the question — can we train our agent to learn inside of its own dream, and transfer this policy back to the actual environment?

If our world model is sufficiently accurate for its purpose, and complete enough for the problem at hand, we should be able to substitute the actual environment with this world model. After all, our agent does not directly observe the reality, but only sees what the world model lets it see. In this experiment, we train an agent inside the hallucination generated by its world model trained to mimic a VizDoom

The agent must learn to avoid fireballs shot by monsters from the other side of the room with the sole intent of killing the agent. There are no explicit rewards in this environment, so to mimic natural selection, the cumulative reward can be defined to be the number of time steps the agent manages to stay alive in a rollout. Each rollout in the environment runs for a maximum of 2100 frames ($\sim$ 60 seconds), and the task is considered solved when the average survival time over 100 consecutive rollouts is greater than 750 frames ($\sim$ 20 seconds)

Our VizDoom experiment is largely the same as the Car Racing task, except for a few key differences. In the Car Racing task, M is only trained to model the next $z_{t}$. Since we want to build a world model we can train our agent in, our M model here will also predict whether the agent dies in the next frame (as a binary event $done_t$), in addition to the next frame $z_t$.

Since the M model can predict the $done$ state in addition to the next observation, we now have all of the ingredients needed to make a full RL environment. We first build an OpenAI Gym environment interface by wrapping a `gym.Env`

In this simulation, we don’t need the V model to encode any real pixel frames during the hallucination process, so our agent will therefore only train entirely in a latent space environment. This has many advantages that will be discussed later on.

This virtual environment has an identical interface to the real environment, so after the agent learns a satisfactory policy in the virtual environment, we can easily deploy this policy back into the actual environment to see how well the policy transfers over.

To summarize the *Take Cover* experiment, below are the steps taken:

- Collect 10,000 rollouts from a random policy.
- Train VAE (V) to encode frames to latent vector $z \in \mathcal{R}^{64}$, and use V to convert the images collected from (1) into latent space representation.
- Train MDN-RNN (M) to model $P(z_{t+1}, done_{t+1} \; | \; a_t, z_t, h_t)$.
- Define Controller (C) as $a_t = W_c \; [z_t \; h_t]$.
- Use CMA-ES to solve for a $W_c$ that maximizes the expected survival time inside the virtual environment.
- Use learned policy from (5) on actual Gym environment.

Model | Parameter Count |
---|---|

VAE | 4,446,915 |

MDN-RNN | 1,678,785 |

Controller | 1,088 |

After some training, our controller learns to navigate around the dream environment and escape from deadly fireballs shot by monsters generated by the M model. Our agent achieved a *score* in this virtual environment of $\sim$ 900 frames.

The following demo shows how our agent navigates inside its own dream. The M model learns to generate monsters that shoot fireballs at the direction of the agent, while the C model discovers a policy to avoid these hallucinated fireballs. Here, the V model is only used to decode the latent vectors $z_t$ produced by M into a sequence of pixel images we can see:

Here, our RNN-based world model is trained to mimic a complete game environment designed by human programmers. By learning only from raw image data collected from random episodes, it learns how to simulate the essential aspects of the game — such as the game logic, enemy behaviour, physics, and the 3D graphics rendering.

For instance, if the agent selects the left action, the M model learns to move the agent to the left and adjust its internal representation of the game states accordingly. It also learns to block the agent from moving beyond the walls on both sides of the level if the agent attempts to move too far in either direction. Occasionally, the M model needs to keep track of multiple fireballs being shot from several different monsters and coherently move them along in their intended directions. It must also detect whether the agent has been killed by one of these fireballs.

Unlike the actual game environment, however, we note that it is possible to add extra uncertainty into the virtual environment, thus making the game more challenging in the dream environment. We can do this by increasing the temperature $\tau$ parameter during the sampling process of $z_{t+1}$, as done in

We find agents that perform well in higher temperature settings generally perform better in the normal setting. In fact, increasing $\tau$ helps prevent our controller from taking advantage of the imperfections of our world model — we will discuss this in more depth later on.

We took the agent trained in the virtual environment and tested its performance on the original VizDoom scenario. The score over 100 random consecutive trials is $\sim$ 1100 frames, far beyond the required score of 750, and also much higher than the score obtained inside the more difficult virtual environment.^{7}

We see that even though the V model is not able to capture all of the details of each frame correctly, for instance, getting the number of monsters correct, the agent is still able to use the learned policy to navigate in the real environment. The virtual environment also did not keep track of a clear number of monsters in the first place, and an agent that is able to survive the noisier and uncertain virtual nightmare environment will thrive in this clean, noiseless environment.

In our childhood, we may have encountered ways to exploit video games in ways that were not intended by the original game designer

For instance, in our initial experiments, we noticed that our agent discovered an *adversarial* policy to move around in such a way so that the monsters in this virtual environment governed by the M model never shoots a single fireball in some rollouts. Even when there are signs of a fireball forming, the agent will move in a way to extinguish the fireballs magically as if it has superpowers in the environment.

Because our world model is only an approximate probabilistic model of the environment, it will occasionally generate trajectories that do not follow the laws governing the actual environment. As we saw previously, even the number of monsters on the other side of the room in the actual environment is not exactly reproduced by the world model. Like a child who learns that objects in the air usually fall to the ground, the child might also imagine unrealistic superheroes who fly across the sky. For this reason, our world model will be exploitable by the controller, even if in the actual environment such exploits do not exist.

And since we are using the M model to generate a virtual dream environment for our agent, we are also giving the controller access to all of the hidden states of M. This is essentially granting our agent access to all of the internal states and memory of the game engine of the game it is playing. Therefore our agent can efficiently explore ways to directly manipulate the hidden states of the game engine in its quest to maximize its expected cumulative reward. The weakness of this approach of learning a policy inside a learned dynamics model is that our agent can easily find an adversarial policy that can fool our dynamics model — it’ll find a policy that looks good under our dynamics model, but will fail in the actual environment, usually because it visits states where the model is wrong because they are away from the training distribution.

This weakness could be the reason that many previous works that learn dynamics models of RL environments but don’t actually use those models to fully replace the actual environments ^{8}

To make it more difficult for our C model to exploit deficiencies in the M model, we chose to use the MDN-RNN as the dynamics model, which models the *distribution* of possible outcomes in the actual environment, rather than merely predicting a deterministic future. Even if the actual environment is deterministic, the MDN-RNN would in effect approximate it as a stochastic environment. This has the advantage of allowing us to train our C model inside a more stochastic version of any environment — we can simply adjust the temperature $\tau$ parameter to control the amount of randomness in the M model, hence controlling the tradeoff between realism and exploitability.

Using a mixture of Gaussian model may seem like overkill given that the latent space encoded with the VAE model is just a diagonal Gaussian. However, the discrete modes in a mixture density model is useful for environments with random discrete events, such as whether a monster decides to shoot a fireball or stay put. While a Gaussian might be sufficient to encode individual frames, a RNN with a mixture density output layer makes it easier to model the logic behind a more complicated environment with discrete random states.

For instance, if we set the temperature parameter to a very low value of $\tau=0.1$, effectively training our C model with an M model that is almost identical to a deterministic LSTM, the monsters inside this dream environment fail to shoot fireballs, no matter what the agent does, due to mode collapse. The M model is not able to *jump* to another mode in the mixture of Gaussian model where fireballs are formed and shot. Whatever policy trained in this dream will get a perfect score of 2100 most of the time, but will obviously fail when unleashed into the harsh reality of the actual world, underperforming even a random policy.

In the following demo, we show that even low values of $\tau \sim 0.5$ make it difficult for the MDN-RNN to generate fireballs:

Note again, however, that the simpler and more robust approach in *subroutines* (parts of M’s weight matrix) for arbitrary computational purposes but can also learn to ignore M when M is useless and when ignoring M yields better performance. Nevertheless, at least in our present C—M variant, M’s predictions are essential for teaching C, more like in some of the early C—M systems

By making the temperature $\tau$ an adjustable parameter of the M model, we can see the effect of training the C model on hallucinated virtual environments with different levels of uncertainty, and see how well they transfer over to the actual environment. We experimented with varying the temperature in the virtual environment and observing the resulting average score over 100 random rollouts in the actual environment after training the agent inside the virtual environment with a given temperature:

$\;\;$Temperature$\;\;$ |
$\;\;$ Score in Virtual Environment |
$\;\;$Score in Actual Environment$\;\;$ |
---|---|---|

0.10 |
2086 $\pm$ 140 |
193 $\pm$ 58 |

0.50 |
2060 $\pm$ 277 |
196 $\pm$ 50 |

1.00 |
1145 $\pm$ 690 |
868 $\pm$ 511 |

1.15 |
918 $\pm$ 546 |
1092 $\pm$ 556 |

1.30 |
732 $\pm$ 269 |
753 $\pm$ 139 |

Random Policy Baseline |
N/A |
210 $\pm$ 108 |

Gym Leaderboard |
N/A |
820 $\pm$ 58 |

We see that while increasing the temperature of the M model makes it more difficult for the C model to find adversarial policies, increasing it too much will make the virtual environment too difficult for the agent to learn anything, hence in practice it is a hyperparameter we can tune. The temperature also affects the types of strategies the agent discovers. For example, although the best score obtained is 1092 $\pm$ 556 over 100 random trials using a temperature of 1.15, increasing $\tau$ a notch to 1.30 results in a lower score but at the same time a less risky strategy with a lower variance of returns. For comparison, the best score on the OpenAI Gym leaderboard

In our experiments, the tasks are relatively simple, so a reasonable world model can be trained using a dataset collected from a random policy. But what if our environments become more sophisticated? In any difficult environment, only parts of the world are made available to the agent only after it learns how to strategically navigate through its world.

For more complicated tasks, an iterative training procedure is required. We need our agent to be able to explore its world, and constantly collect new observations so that its world model can be improved and refined over time. An iterative training procedure, adapted from *Learning To Think*

- Initialize M, C with random model parameters.
- Rollout to actual environment $N$ times. Agent may learn during rollouts. Save all actions $a_t$ and observations $x_t$ during rollouts to storage device.
- Train M to model $P(x_{t+1}, r_{t+1}, a_{t+1}, done_{t+1} \; | \; x_t, a_t, h_t)$.
- Go back to (2) if task has not been completed.

We have shown that one iteration of this training loop was enough to solve simple tasks. For more difficult tasks, we need our controller in Step 2 to actively explore parts of the environment that is beneficial to improve its world model. An exciting research direction is to look at ways to incorporate artificial curiosity and intrinsic motivation

In the present approach, since M is a MDN-RNN that models a probability distribution for the next frame, if it does a poor job, then it means the agent has encountered parts of the world that it is not familiar with. Therefore we can adapt and reuse M’s training loss function to encourage curiosity. By flipping the sign of M’s loss function in the actual environment, the agent will be encouraged to explore parts of the world that it is not familiar with. The new data it collects may improve the world model.

The iterative training procedure requires the M model to not only predict the next observation $x$ and $done$, but also predict the action and reward for the next time step. This may be required for more difficult tasks. For instance, if our agent needs to learn complex motor skills to walk around its environment, the world model will learn to imitate its own C model that has already learned to walk. After difficult motor skills, such as walking, is absorbed into a large world model with lots of capacity, the smaller C model can rely on the motor skills already absorbed by the world model and focus on learning more higher level skills to navigate itself using the motor skills it had already learned.^{9}

An interesting connection to the neuroscience literature is the work on hippocampal replay that examines how the brain replays recent experiences when an animal rests or sleeps. Replaying recent experiences plays an important role in memory consolidation*Replay Comes of Age*

Iterative training could allow the C—M model to develop a natural hierarchical way to learn. Recent works about self-play in RL

There is extensive literature on learning a dynamics model, and using this model to train a policy. Many concepts first explored in the 1980s for feed-forward neural networks (FNNs)*Learning to Think*

While Gaussian processes work well with a small set of low dimensional data, their computational complexity makes them difficult to scale up to model a large history of high dimensional observations. Other recent works

In robotic control applications, the ability to learn the dynamics of a system from observing only camera-based video inputs is a challenging but important problem. Early work on RL for active vision trained an FNN to take the current image frame of a video sequence to predict the next frame

Video game environments are also popular in model-based RL research as a testbed for new ideas. Guzdial et al.

The works mentioned above use FNNs to predict the next video frame. We may want to use models that can capture longer term time dependencies. RNNs are powerful models suitable for sequence modelling*Hallucination with RNNs*

Using RNNs to develop internal models to reason about the future has been explored as early as 1990 in a paper called *Making the World Differentiable**Learning to Think*

In this work, we used evolution strategies (ES) to train our controller, as it offers many benefits. For instance, we only need to provide the optimizer with the final cumulative reward, rather than the entire history. ES is also easy to parallelize — we can launch many instances of `rollout`

with different solutions to many workers and quickly compute a set of cumulative rewards in parallel. Recent works

Before the popularity of Deep RL methods

We have demonstrated the possibility of training an agent to perform tasks entirely inside of its simulated latent space dream world. This approach offers many practical benefits. For instance, running computationally intensive game engines require using heavy compute resources for rendering the game states into image frames, or calculating physics not immediately relevant to the game. We may not want to waste cycles training an agent in the actual environment, but instead train the agent as many times as we want inside its simulated environment. Training agents in the real world is even more expensive, so world models that are trained incrementally to simulate reality will make it easier to experiment with different approaches for training our agents.

Furthermore, we can take advantage of deep learning frameworks to accelerate our world model simulations using GPUs in a distributed environment. The benefit of implementing the world model as a fully differentiable recurrent computation graph also means that we may be able to train our agents in the dream directly using the backpropagation algorithm to fine-tune its policy to maximize an objective function

The choice of using a VAE for the V model and training it as a standalone model also has its limitations, since it may encode parts of the observations that are not relevant to a task. After all, unsupervised learning cannot, by definition, know what will be useful for the task at hand. For instance, it reproduced unimportant detailed brick tile patterns on the side walls in the Doom environment, but failed to reproduce task-relevant tiles on the road in the Car Racing environment. By training together with an M model that predicts rewards, the VAE may learn to focus on task-relevant areas of the image, but the tradeoff here is that we may not be able to reuse the VAE effectively for new tasks without retraining.

Learning task-relevant features has connections to neuroscience as well. Primary sensory neurons are released from inhibition when rewards are received, which suggests that they generally learn task-relevant features, rather than just any features, at least in adulthood

Future work might explore the use of an unsupervised segmentation layer like in

Another concern is the limited capacity of our world model. While modern storage devices can store large amounts of historical data generated using the iterative training procedure, our LSTM

Like early RNN-based C—M systems *Learning To Think**One Big Net*

*This work is meant to be a live research project and will be revised and expanded over time. This article will be the first of a series of articles exploring World Models. If you would like to discuss any issues, give feedback, or even contribute to future work, please visit the GitHub repository of this page for more information.*

We would like to thank Blake Richards, Kory Mathewson, Kyle McDonald, Kai Arulkumaran, Ankur Handa, Denny Britz, Elwin Ha and Natasha Jaques for their thoughtful feedback on this article, and for offering their valuable perspectives and insights from their areas of expertise.

The interative demos in this article were all built using p5.js. Deploying all of these machine learning models in a web browser was made possible with deeplearn.js, a hardware-accelerated machine learning framework for the browser, developed by the People+AI Research Initiative (PAIR) team at Google. A special thanks goes to Nikhil Thorat and Daniel Smilkov for their support.

We would like to thank Chris Olah and the rest of the Distill editorial team for their valuable feedback and generous editorial support, in addition to supporting the use of their distill.pub technology.

We would to extend our thanks to Alex Graves, Douglas Eck, Mike Schuster, Rajat Monga, Vincent Vanhoucke, Jeff Dean and the Google Brain team for helpful feedback and for encouraging us to explore this area of research.

Any errors here are our own and do not reflect opinions of our proofreaders and colleagues. If you see mistakes or want to suggest changes, feel free to contribute feedback by participating in the discussion forum for this article.

The experiments in this article were performed on both a P100 GPU and a 64-core CPU Ubuntu Linux virtual machine provided by Google Cloud Platform, using TensorFlow and OpenAI Gym.

For attribution in academic contexts, please cite this work as

Ha and Schmidhuber, "World Models", 2018. https://doi.org/10.5281/zenodo.1207631

BibTeX citation

@article{Ha2018WorldModels, author = {Ha, D. and Schmidhuber, J.}, title = {World Models}, eprint = {arXiv:1803.10122}, doi = {10.5281/zenodo.1207631}, url = {https://worldmodels.github.io}, year = {2018} }

The code to reproduce experiments in this work, as well as IPython notebooks for training and visualizing VAE and MDN-RNN models will be made available at a later date.

Diagrams and text are licensed under Creative Commons Attribution CC-BY 4.0 with the source available on GitHub, unless noted otherwise. The figures that have been reused from other sources don’t fall under this license and can be recognized by the citations in their caption.

In this section we will describe in more details the models and training methods used in this work.

We trained a Convolutional Variational Autoencoder (ConvVAE) model as the V Model of our agent. Unlike vanilla autoencoders, enforcing a Gaussian prior over the latent vector $z$ also limits the amount its information capacity for compressing each frame, but this Gaussian prior also makes the world model more robust to unrealistic $z$ vectors generated by the M Model. As the environment may give us observations as high dimensional pixel images, we first resize each image to 64x64 pixels before as use this resized image as the V Model’s observation. Each pixel is stored as three floating point values between 0 and 1 to represent each of the RGB channels. The ConvVAE takes in this 64x64x3 input tensor and passes this data through 4 convolutional layers to *encode* it into low dimension vectors $\mu$ and $\sigma$, each of size $N_z$. The latent vector $z$ is sampled from the Gaussian prior $N(\mu, \sigma I)$. In the *deconvolution* layers used to *decode* and reconstruct the image.

In the following diagram, we describe the shape of our tensor at each layer of the ConvVAE and also describe the details of each layer:

Each convolution and deconvolution layer uses a stride of 2. The layers are indicated in the diagram in *Italics* as *Activation-type Output Channels x Filter Size*. All convolutional and deconvolutional layers use relu activations except for the output layer as we need the output to be between 0 and 1. We trained the model for 1 epoch over the data collected from a random policy, using $L^2$ distance between the input image and the reconstruction to quantify the reconstruction loss we optimize for, in addition to the KL loss.

For the M Model, we use an

Unlike the handwriting and sketch generation works, rather than using the MDN-RNN to model the pdf of the next pen stroke, we model instead the pdf of the next latent vector $z$. We would sample from this pdf at each timestep to generate the hallucinated environments. In the Doom task, we also also use the MDN-RNN to predict the probability of whether the agent has died in this frame. If that probability is above 50%, then we set `done`

to be `True`

in the virtual dream environment. Given that death is a low probability event at each timestep, we find the cutoff approach to more stable compared to sampling from the Bernoulli distribution.

The MDN-RNNs were trained for 20 epochs on the data collected from a random policy agent. In the Car Racing task, the LSTM used 256 hidden units, while the Doom task used 512 hidden units. In both tasks, we used 5 Gaussian mixtures and did not model the correlation $\rho$ parameter, hence $z$ is sampled from a factored mixture of Gaussian distribution.

When training the MDN-RNN using teacher forcing from the recorded data, we store a pre-computed set of $\mu$ and $\sigma$ for each of the frames, and sample an input $z \sim N(\mu, \sigma)$ each time we construct a training batch, to prevent overfitting our MDN-RNN to a specific sampled $z$.

For both environments, we applied $\tanh$ nonlinearities to clip and bound the action space to the appropriate ranges. For instance, in the Car Racing task, the steering wheel has a range from -1 to 1, the acceleration pedal from 0 to 1, and the brakes from 0 to 1. In the Doom environment, we converted the discrete actions into a continuous action space between -1 to 1, and divided this range into thirds to indicate whether the agent is moving left, staying where it is, or moving to the right. We would give the C Model a feature vector as its input, consisting of $z$ and the hidden state of the MDN-RNN. In the Car Racing task, this hidden state is the output vector $h$ of the LSTM, while for the Doom task it is both the cell vector $c$ and the output vector $h$ of the LSTM.

We used *average cumulative reward* of the 16 random rollouts. The diagram below charts the best performer, worst performer, and mean fitness of the population of 64 agents at each generation:

Since the requirement of this environment is to have an agent achieve an average score above 900 over 100 random rollouts, we took the best performing agent at the end of every 25 generations, and tested that agent over 1024 random rollout scenarios to record this average on the red line. After 1800 generations, an agent was able to achieve an average score of 900.46 over 1024 random rollouts. We used 1024 random rollouts rather than 100 because each process of the 64 core machine had been configured to run 16 times already, effectively using a full generation of compute after every 25 generations to evaluate the best agent 1024 times. Below, we plot the results of same agent evaluated over 100 rollouts:

We also experimented with an agent that has access to only the $z$ vector from the VAE, and not letting it see the RNN’s hidden states. We tried 2 variations, where in the first variation, the C Model mapped $z$ directly to the action space $a$. In second variation, we attempted to add a hidden layer with 40 $tanh$ activations between $z$ and $a$, increasing the number of model parameters of the C Model to 1443, making it more comparable with the original setup.

We conducted a similar experiment on the hallucinated Doom environment we called *DoomRNN*. Please note that we have not actually attempted to train our agent on the actual *DoomRNN* is more computationally efficient compared to VizDoom as it only operates in latent space without the need to render a screenshot at each timestep, and does not require running the actual Doom game engine.

In the virtual DoomRNN environment we constructed, we increased the temperature slightly and used $\tau=1.15$ to make the agent learn in a more challenging environment. The best agent managed to obtain an average score of 959 over 1024 random rollouts (the highest score of the red line in the diagram). This same agent achieved an average score of 1092 $\pm$ 556 over 100 random rollouts when deployed to the actual environment

- In many RL problems, the feedback (positive or negative reward) is given at end of a sequence of steps. The credit assignment problem tackles the problem of figuring out which steps caused the resulting feedback—which steps should receive credit or blame for a final result?
- Typical model-free RL models have in the order of $10^3$ to $10^6$ model parameters. We look at training models in the order of $10^7$ parameters, which is still rather small compared to state-of-the-art deep learning models with $10^8$ to even $10^{9}$ parameters. In principle, the procedure described in this article can take advantage of these larger networks if we wanted to use them.
- To be clear, the prediction of $z_{t+1}$ is not fed into the controller C directly — just the hidden state $h_t$ and $z_t$. This is because $h_t$ has all the information needed to generate the parameters of a mixture of Gaussian distribution, if we want to sample $z_{t+1}$ to make a prediction.
- We find this task interesting because although it is not difficult to train an agent to wobble around randomly generated tracks and obtain a mediocre score, CarRacing-v0 defines “solving” as getting average reward of 900 over 100 consecutive trials, which means the agent can only afford very few driving mistakes.
- We will discuss an iterative training procedure later on for more complicated environments where a random policy is not sufficient.
- In principle, we can train both models together in an end-to-end manner, although we found that training each separately is more practical, and also achieves satisfactory results. Training each model only required less than an hour of computation time using a single NVIDIA P100 GPU. We can also train individual VAE and MDN-RNN models without having to exhaustively tune hyperparameters.
- We will discuss how this score compares to other models later on.
- In
*Learning to Think*, it is acceptable that the RNN M isn’t always a reliable predictor. A (potentially evolution-based) RNN C can in principle learn to ignore a flawed M, or exploit certain useful parts of M for arbitrary computational purposes including hierarchical planning etc. This is not what we do here though — our present approach is still closer to some of the old systems, where a RNN M is used to predict and plan ahead step by step. Unlike this early work, however, we use evolution for C (like in*Learning to Think*) rather than traditional RL combined with RNNs, which has the advantage of both simplicity and generality. - Another related connection is to muscle memory. For instance, as you learn to do something like play the piano, you no longer have to spend working memory capacity on translating individual notes to finger motions — this all becomes encoded at a subconscious level.

**OpenAI Gym**[PDF]

Brockman, G., Cheung, V., Pettersson, L., Schneider, J., Schulman, J., Tang, J. and Zaremba, W., 2016. ArXiv preprint.**Understanding Comics: The Invisible Art**[link]

McCloud, S., 1993. Tundra Publishing.**More thoughts from Understanding Comics by Scott McCloud**[link]

E, M., 2012. Tumblr.**Counterintuitive behavior of social systems**[link]

Forrester, J.W., 1971. Technology Review.**The Code for Facial Identity in the Primate Brain**[link]

Cheang, L. and Tsao, D., 2017. Cell. DOI: 10.1016/j.cell.2017.05.011**Invariant visual representation by single neurons in the human brain**[HTML]

Quiroga, R., Reddy, L., Kreiman, G., Koch, C. and Fried, I., 2005. Nature. DOI: 10.1038/nature03687**Primary Visual Cortex Represents the Difference Between Past and Present**[link]

Nortmann, N., Rekauzke, S., Onat, S., König, P. and Jancke, D., 2015. Cerebral Cortex, Vol 25(6), pp. 1427-1440. DOI: 10.1093/cercor/bht318**Motion-Dependent Representation of Space in Area MT+**[link]

Gerrit, M., Fischer, J. and Whitney, D., 2013. Neuron. DOI: 10.1016/j.neuron.2013.03.010**Akiyoshi’s Illusion Pages**[HTML]

Kitaoka, A., 2002. Kanzen.**Peripheral drift illusion**[link]

Authors, W., 2017. Wikipedia.**Illusory Motion Reproduced by Deep Neural Networks Trained for Prediction**[link]

Watanabe, E., Kitaoka, A., Sakamoto, K., Yasugi, M. and Tanaka, K., 2018. Frontiers in Psychology, Vol 9, pp. 345. DOI: 10.3389/fpsyg.2018.00345**Sensorimotor Mismatch Signals in Primary Visual Cortex of the Behaving Mouse**[link]

Keller, G., Bonhoeffer, T. and Hübener, M., 2012. Neuron, Vol 74(5), pp. 809 - 815. DOI: https://doi.org/10.1016/j.neuron.2012.03.040**A Sensorimotor Circuit in Mouse Cortex for Visual Flow Predictions**[link]

Leinweber, M., Ward, D.R., Sobczak, J.M., Attinger, A. and Keller, G.B., 2017. Neuron, Vol 95(6), pp. 1420 - 1432.e5. DOI: https://doi.org/10.1016/j.neuron.2017.08.036**The ecology of human fear: survival optimization and the nervous system.**[link]

Mobbs, D., Hagan, C.C., Dalgleish, T., Silston, B. and Prévost, C., 2015. Frontiers in Neuroscience. DOI: 10.3389/fnins.2015.00055**Baseball Icon Design (CC 3.0)**[link]

Sotil, G., 2018. The Noun Project.**Tracking Fastballs**[link]

Hirshon, B., 2013. Science Update Interview.**Reinforcement learning: a survey**

Kaelbling, L.P., Littman, M.L. and Moore, A.W., 1996. Journal of AI research, Vol 4, pp. 237—285.**Introduction to Reinforcement Learning**[PDF]

Sutton, R.S. and Barto, A.G., 1998. MIT Press.**Reinforcement Learning**

Wiering, M. and van Otterlo, M., 2012. Springer.**Learning How the World Works: Specifications for Predictive Networks in Robots and Brains**

Werbos, P.J., 1987. Proceedings of IEEE International Conference on Systems, Man and Cybernetics, N.Y..**David Silver’s Lecture on Integrating Learning and Planning**[PDF]

Silver, D., 2017.**Making the World Differentiable: On Using Self-Supervised Fully Recurrent Neural Networks for Dynamic Reinforcement Learning and Planning in Non-Stationary Environments**[PDF]

Schmidhuber, J., 1990.**An on-line algorithm for dynamic reinforcement learning and planning in reactive environments**[link]

Schmidhuber, J., 1990. 1990 IJCNN International Joint Conference on Neural Networks, pp. 253-258 vol.2. DOI: 10.1109/IJCNN.1990.137723**Reinforcement Learning in Markovian and Non-Markovian Environments**[PDF]

Schmidhuber, J., 1991. Advances in Neural Information Processing Systems 3, pp. 500—506. Morgan-Kaufmann.**The representation of the cumulative rounding error of an algorithm as a Taylor expansion of the local rounding errors**

Linnainmaa, S., 1970.**Gradient Theory of Optimal Flight Paths**

Kelley, H.J., 1960. ARS Journal, Vol 30(10), pp. 947-954.**Applications of advances in nonlinear sensitivity analysis**

Werbos, P.J., 1982. System modeling and optimization, pp. 762—770. Springer.**Deep Reinforcement Learning: A Brief Survey**[PDF]

Arulkumaran, K., Deisenroth, M.P., Brundage, M. and Bharath, A.A., 2017. IEEE Signal Processing Magazine, Vol 34(6), pp. 26-38. DOI: 10.1109/MSP.2017.2743240**Deep Learning in Neural Networks: An Overview**

Schmidhuber, J., 2015. Neural Networks, Vol 61, pp. 85-117. DOI: 10.1016/j.neunet.2014.09.003**A Possibility for Implementing Curiosity and Boredom in Model-building Neural Controllers**[PDF]

Schmidhuber, J., 1990. Proceedings of the First International Conference on Simulation of Adaptive Behavior on From Animals to Animats, pp. 222—227. MIT Press.**On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models**[PDF]

Schmidhuber, J., 2015. ArXiv preprint.**Auto-Encoding Variational Bayes**[PDF]

Kingma, D. and Welling, M., 2013. ArXiv preprint.**Stochastic Backpropagation and Approximate Inference in Deep Generative Models**[PDF]

Jimenez Rezende, D., Mohamed, S. and Wierstra, D., 2014. ArXiv preprint.**ViZDoom: A Doom-based AI Research Platform for Visual Reinforcement Learning**[PDF]

Kempka, M., Wydmuch, M., Runc, G., Toczek, J. and Jaskowski, W., 2016. IEEE Conference on Computational Intelligence and Games, pp. 341—348. IEEE.**DoomTakeCover-v0**[link]

Paquette, P., 2016.**A Neural Representation of Sketch Drawings**[link]

Ha, D. and Eck, D., 2017. ArXiv preprint.**Draw Together with a Neural Network**[link]

Ha, D., Jongejan, J. and Johnson, I., 2017. Google AI Experiments.**Mixture density networks**[link]

Bishop, C.M., 1994. Technical Report. Aston University.**Mixture Density Networks with TensorFlow**[link]

Ha, D., 2015. blog.otoro.net.**Generating sequences with recurrent neural networks**[PDF]

Graves, A., 2013. ArXiv preprint.**Recurrent Neural Network Tutorial for Artists**[link]

Ha, D., 2017. blog.otoro.net.**Experiments in Handwriting with a Neural Network**[link]

Carter, S., Ha, D., Johnson, I. and Olah, C., 2016. Distill. DOI: 10.23915/distill.00004**Evolutionsstrategie: optimierung technischer systeme nach prinzipien der biologischen evolution**[link]

Rechenberg, I., 1973. Frommann-Holzboog.**Numerical Optimization of Computer Models**[link]

Schwefel, H., 1977. John Wiley and Sons, Inc.**A Visual Guide to Evolution Strategies**[link]

Ha, D., 2017. blog.otoro.net.**The CMA Evolution Strategy: A Tutorial**[PDF]

Hansen, N., 2016. ArXiv preprint.**Completely Derandomized Self-Adaptation in Evolution Strategies**[PDF]

Hansen, N. and Ostermeier, A., 2001. Evolutionary Computation, Vol 9(2), pp. 159—195. MIT Press. DOI: 10.1162/106365601750190398**CarRacing-v0**[link]

Klimov, O., 2016.**Self-driving cars in the browser**[link]

Hünermann, J., 2017.**Mar I/O Kart**[link]

Bling, S., 2015.**Using Keras and Deep Deterministic Policy Gradient to play TORCS**[HTML]

Lau, B., 2016.**Car Racing using Reinforcement Learning**[PDF]

Khan, M. and Elibol, O., 2016.**Reinforcement Car Racing with A3C**[link]

Jang, S., Min, J. and Lee, C., 2017.**Deep-Q Learning for Box2D Racecar RL problem.**[link]

Prieur, L., 2017. “GitHub”.**Video Game Exploits**[link]

Wikipedia, A., 2017. Wikipedia.**Action-Conditional Video Prediction using Deep Networks in Atari Games**[PDF]

Oh, J., Guo, X., Lee, H., Lewis, R. and Singh, S., 2015. ArXiv preprint.**Recurrent Environment Simulators**[PDF]

Chiappa, S., Racaniere, S., Wierstra, D. and Mohamed, S., 2017. ArXiv preprint.**PILCO: A Model-Based and Data-Efficient Approach to Policy Search**[PDF]

Deisenroth, M. and Rasmussen, C., 2011. In Proceedings of the International Conference on Machine Learning.**Neural Network Dynamics for Model-Based Deep Reinforcement Learning with Model-Free Fine-Tuning**[PDF]

Nagabandi, A., Kahn, G., Fearing, R. and Levine, S., 2017. ArXiv preprint.**Formal Theory of Creativity, Fun, and Intrinsic Motivation (1990-2010).**[HTML]

Schmidhuber, J., 2010. IEEE Trans. Autonomous Mental Development.**Developmental Robotics, Optimal Artificial Curiosity, Creativity, Music, and the Fine Arts**

Schmidhuber, J., 2006. Connection Science, Vol 18(2), pp. 173—187.**Curious Model-Building Control Systems**

Schmidhuber, J., 1991. In Proc. International Joint Conference on Neural Networks, Singapore, pp. 1458—1463. IEEE.**Curiosity-driven Exploration by Self-supervised Prediction**[link]

Pathak, D., Agrawal, P., A., E. and Darrell, T., 2017. ArXiv preprint.**Intrinsic Motivation Systems for Autonomous Mental Development**[PDF]

Oudeyer, P., Kaplan, F. and Hafner, V., 2007. Trans. Evol. Comp. IEEE Press. DOI: 10.1109/TEVC.2006.890271**Reinforcement driven information acquisition in nondeterministic environments**

Schmidhuber, J., Storck, J. and Hochreiter, S., 1994.**Information-seeking, curiosity, and attention: computational and neural mechanisms**[PDF]

Gottlieb, J., Oudeyer, P., Lopes, M. and Baranes, A., 2013. Cell. DOI: 10.1016/j.tics.2013.09.001**Abandoning objectives: Evolution through the search for novelty alone**[link]

Lehman, J. and Stanley, K., 2011. Evolutionary Computation, Vol 19(2), pp. 189—223. M I T Press.**Memory Consolidation**[link]

Authors, W., 2017. Wikipedia.**Replay Comes of Age**[link]

Foster, D.J., 2017. Annual Review of Neuroscience, Vol 40(1), pp. 581-602. DOI: 10.1146/annurev-neuro-072116-031538**Intrinsic Motivation and Automatic Curricula via Asymmetric Self-Play**[PDF]

Sukhbaatar, S., Lin, Z., Kostrikov, I., Synnaeve, G., Szlam, A. and Fergus, R., 2017. ArXiv preprint.**Emergent Complexity via Multi-Agent Competition**[PDF]

Bansal, T., Pachocki, J., Sidor, S., Sutskever, I. and Mordatch, I., 2017. ArXiv preprint.**Continuous Adaptation via Meta-Learning in Nonstationary and Competitive Environments**[PDF]

Al-Shedivat, M., Bansal, T., Burda, Y., Sutskever, I., Mordatch, I. and Abbeel, P., 2017. ArXiv preprint.**PowerPlay: Training an Increasingly General Problem Solver by Continually Searching for the Simplest Still Unsolvable Problem**[link]

Schmidhuber, J., 2013. Frontiers in Psychology, Vol 4, pp. 313. DOI: 10.3389/fpsyg.2013.00313**First Experiments with PowerPlay**[PDF]

Srivastava, R., Steunebrink, B. and Schmidhuber, J., 2012. ArXiv preprint.**Optimal Ordered Problem Solver**[PDF]

Schmidhuber, J., 2002. ArXiv preprint.**A Dual Back-Propagation Scheme for Scalar Reinforcement Learning**

Munro, P.W., 1987. Proceedings of the Ninth Annual Conference of the Cognitive Science Society, Seattle, WA, pp. 165-176.**Dynamic Reinforcement Driven Error Propagation Networks with Application to Game Playing**

Robinson, T. and Fallside, F., 1989. CogSci 89.**Neural Networks for Control and System Identification**

Werbos, P.J., 1989. Proceedings of IEEE/CDC Tampa, Florida.**The truck backer-upper: An example of self learning in neural networks**

Nguyen, N. and Widrow, B., 1989. Proceedings of the International Joint Conference on Neural Networks, pp. 357-363. IEEE Press.**Lecture Slides on PILCO**[PDF]

Duvenaud, D., 2016. CSC 2541 Course at University of Toronto.**Data-Efficient Reinforcement Learning in Continuous-State POMDPs**[PDF]

McAllister, R. and Rasmussen, C., 2016. ArXiv preprint.**Improving PILCO with Bayesian Neural Network Dynamics Models**[PDF]

Gal, Y., McAllister, R. and Rasmussen, C., 2016. ICML Workshop on Data-Efficient Machine Learning.**Learning and Policy Search in Stochastic Dynamical Systems with Bayesian Neural Networks**[PDF]

Depeweg, S., Hernandez-Lobato, J., Doshi-Velez, F. and Udluft, S., 2016. ArXiv preprint.**A Benchmark Environment Motivated by Industrial Control Problems**[PDF]

Hein, D., Depeweg, S., Tokic, M., Udluft, S., Hentschel, A., Runkler, T. and Sterzing, V., 2017. ArXiv preprint.**Learning to Generate Artificial Fovea Trajectories for Target Detection**[PDF]

Schmidhuber, J. and Huber, R., 1991. International Journal of Neural Systems, Vol 2(1-2), pp. 125—134. DOI: 10.1142/S012906579100011X**Learning deep dynamical models from image pixels**[PDF]

Wahlström, N., Schön, T. and Deisenroth, M., 2014. ArXiv preprint.**From Pixels to Torques: Policy Learning with Deep Dynamical Models**[PDF]

Wahlström, N., Schön, T. and Deisenroth, M., 2015. ArXiv preprint.**Deep Spatial Autoencoders for Visuomotor Learning**[PDF]

Finn, C., Tan, X., Duan, Y., Darrell, T., Levine, S. and Abbeel, P., 2015. ArXiv preprint.**Embed to Control: A Locally Linear Latent Dynamics Model for Control from Raw Images**[PDF]

Watter, M., Springenberg, J., Boedecker, J. and Riedmiller, M., 2015. ArXiv preprint.**Model-Based RL Lecture at Deep RL Bootcamp 2017**[link]

Finn, C., 2017.**Game Engine Learning from Video**[link]

Matthew Guzdial, B.L., 2017. Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI-17, pp. 3707—3713. DOI: 10.24963/ijcai.2017/518**Learning to Act by Predicting the Future**[PDF]

Dosovitskiy, A. and Koltun, V., 2016. ArXiv preprint.**Hallucination with Recurrent Neural Networks**[link]

Graves, A., 2015.**Unsupervised Learning of Disentangled Representations from Video**[PDF]

Denton, E. and Birodkar, V., 2017. ArXiv preprint.**The Predictron: End-To-End Learning and Planning**[PDF]

Silver, D., van Hasselt, H., Hessel, M., Schaul, T., Guez, A., Harley, T., Dulac-Arnold, G., Reichert, D., Rabinowitz, N., Barreto, A. and Degris, T., 2016. ArXiv preprint.**Imagination-Augmented Agents for Deep Reinforcement Learning**[PDF]

Weber, T., Racanière, S., Reichert, D., Buesing, L., Guez, A., Rezende, D., Badia, A., Vinyals, O., Heess, N., Li, Y., Pascanu, R., Battaglia, P., Silver, D. and Wierstra, D., 2017. ArXiv preprint.**Visual Interaction Networks**[PDF]

Watters, N., Tacchetti, A., Weber, T., Pascanu, R., Battaglia, P. and Zoran, D., 2017. ArXiv preprint.**PathNet: Evolution Channels Gradient Descent in Super Neural Networks**[PDF]

Fernando, C., Banarse, D., Blundell, C., Zwols, Y., Ha, D., Rusu, A., Pritzel, A. and Wierstra, D., 2017. ArXiv preprint.**Evolution Strategies as a Scalable Alternative to Reinforcement Learning**[PDF]

Salimans, T., Ho, J., Chen, X., Sidor, S. and Sutskever, I., 2017. ArXiv preprint.**Evolving Stable Strategies**[link]

Ha, D., 2017. blog.otoro.net.**Welcoming the Era of Deep Neuroevolution**[link]

Stanley, K. and Clune, J., 2017. Uber AI Research.**Playing Atari with Deep Reinforcement Learning**[PDF]

Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D. and Riedmiller, M., 2013. ArXiv preprint.**Evolving Neural Networks Through Augmenting Topologies**[link]

Stanley, K.O. and Miikkulainen, R., 2002. Evolutionary Computation, Vol 10(2), pp. 99-127.**Accelerated Neural Evolution Through Cooperatively Coevolved Synapses**[PDF]

Gomez, F., Schmidhuber, J. and Miikkulainen, R., 2008. Journal of Machine Learning Research, Vol 9, pp. 937—965. JMLR.org.**Co-evolving Recurrent Neurons Learn Deep Memory POMDPs**[PDF]

Gomez, F. and Schmidhuber, J., 2005. Proceedings of the 7th Annual Conference on Genetic and Evolutionary Computation, pp. 491—498. ACM. DOI: 10.1145/1068009.1068092**Autonomous Evolution of Topographic Regularities in Artificial Neural Networks**[PDF]

Gauci, J. and Stanley, K.O., 2010. Neural Computation, Vol 22(7), pp. 1860—1898. MIT Press. DOI: 10.1162/neco.2010.06-09-1042**Parameter-exploring policy gradients**[link]

Sehnke, F., Osendorfer, C., Ruckstieb, T., Graves, A., Peters, J. and Schmidhuber, J., 2010. Neural Networks, Vol 23(4), pp. 551—559. DOI: 10.1016/j.neunet.2009.12.004**Evolving Neural Networks**[PDF]

Miikkulainen, R., 2013. IJCNN.**Evolving Large-scale Neural Networks for Vision-based Reinforcement Learning**[HTML]

Koutnik, J., Cuccu, G., Schmidhuber, J. and Gomez, F., 2013. Proceedings of the 15th Annual Conference on Genetic and Evolutionary Computation, pp. 1061—1068. ACM. DOI: 10.1145/2463372.2463509**A Neuroevolution Approach to General Atari Game Playing**[link]

Hausknecht, M., Lehman, J., Miikkulainen, R. and Stone, P., 2013. IEEE Transactions on Computational Intelligence and AI in Games.**Neuro-Visual Control in the Quake II Environment**[PDF]

Parker, M. and Bryant, B., 2012. IEEE Transactions on Computational Intelligence and AI in Games.**Autoencoder-augmented Neuroevolution for Visual Doom Playing**[PDF]

Alvernaz, S. and Togelius, J., 2017. ArXiv preprint.**Cortical interneurons that specialize in disinhibitory control**[link]

Pi, H., Hangya, B., Kvitsiani, D., Sanders, J., Huang, Z. and Kepecs, A., 2013. Nature. DOI: 10.1038/nature12676**SE3-Pose-Nets: Structured Deep Dynamics Models for Visuomotor Planning and Control**[PDF]

Byravan, A., Leeb, F., Meier, F. and Fox, D., 2017. ArXiv preprint.**Long short-term memory**[PDF]

Hochreiter, S. and Schmidhuber, J., 1997. Neural Computation. MIT Press.**Learning to Forget: Continual Prediction with LSTM**[PDF]

Gers, F., Schmidhuber, J. and Cummins, F., 2000. Neural Computation, Vol 12(10), pp. 2451—2471. MIT Press. DOI: 10.1162/089976600300015015**Nanoconnectomic upper bound on the variability of synaptic plasticity**[link]

Bartol, T.M., Bromer, C., Kinney, J., Chirillo, M.A., Bourne, J.N., Harris, K.M. and Sejnowski, T.J., 2015. eLife Sciences Publications, Ltd. DOI: 10.7554/eLife.10778**Connectionist models of recognition memory: constraints imposed by learning and forgetting functions.**

Ratcliff, R.M., 1990. Psychological review, Vol 97 2, pp. 285-308.**Catastrophic interference in connectionist networks: Can It Be predicted, can It be prevented?**[PDF]

French, R.M., 1994. Advances in Neural Information Processing Systems 6, pp. 1176—1177. Morgan-Kaufmann.**Overcoming catastrophic forgetting in neural networks**[PDF]

Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A.M., Quan, J., Ramalho, T., Grabska-Barwinska, A., Hassabis, D., Clopath, C., Kumaran, D. and Hadsell, R., 2016. ArXiv preprint.**Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer**[PDF]

Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G. and Dean, J., 2017. ArXiv preprint.**HyperNetworks**[PDF]

Ha, D., Dai, A. and Le, Q., 2016. ArXiv preprint.**Language Modeling with Recurrent Highway Hypernetworks**[PDF]

Suarez, J., 2017. Advances in Neural Information Processing Systems 30, pp. 3269—3278. Curran Associates, Inc.**WaveNet: A Generative Model for Raw Audio**[PDF]

van den Oord, A., Dieleman, S., Zen, H., Simonyan, K., Vinyals, O., Graves, A., Kalchbrenner, N., Senior, A. and Kavukcuoglu, K., 2016. ArXiv preprint.**Attention Is All You Need**[PDF]

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A., Kaiser, L. and Polosukhin, I., 2017. ArXiv preprint.**Generative Temporal Models with Memory**[PDF]

Gemici, M., Hung, C., Santoro, A., Wayne, G., Mohamed, S., Rezende, D., Amos, D. and Lillicrap, T., 2017. ArXiv preprint.**One Big Net For Everything**[PDF]

Schmidhuber, J., 2018. Preprint arXiv:1802.08864 [cs.AI].**Learning Complex, Extended Sequences Using the Principle of History Compression**

Schmidhuber, J., 1992. Neural Computation, Vol 4(2), pp. 234-242.