### 1. UC Berkeley

Sergey Levine and Chelsea Finn Research Group

**Online Meta-Learning** [Website]

- Chelsea Finn · Aravind Rajeswaran · Sham Kakade · Sergey Levine
- UC Berkeley
- Notes
- This work introduces an online meta-learning problem setting, which merges ideas from both the aforementioned paradigms in order to better capture the spirit and practice of continual lifelong learning.
- FTML: practical instantiation of our approach, extending MAML meta-train on all data so far, fine-tune on current task

**Efficient Off-Policy Meta-Reinforcement Learning via Probabilistic Context Variables** [Website]

- Kate Rakelly · Aurick Zhou · Chelsea Finn · Sergey Levine · Deirdre Quillen
- UC Berkeley
- Algorithm: PEARL
- Github: github.com/katerakelly/oyster
- Poster: [PDF]
- Notes
- Problem: Current meta-RL methods rely heavily on on-policy experience, limiting their sample efficiency, and lack mechanisms to reason about
`task uncertainty`

when identifying and learning new tasks, limiting their effectiveness in sparse reward problems. - In this paper, we aim to address these challenges by developing an off-policy meta-RL algorithm based on
`online latent task inference`

. - Our method can be interpreted as an implementation of online probabilistic filtering of latent task variables to infer how to solve a new task from small amounts of experience.
- Method features
- Disentangle task inference from control
- Off-Policy Meta-Training
- Efficient exploration by posterior sampling
- First off-policy meta-RL algorithm
- 20-100X improved sample efficiency on the domains tested, often substantially better final returns
- Probabilistic belief over the task enables posterior sampling for efficient exploration

- Problem: Current meta-RL methods rely heavily on on-policy experience, limiting their sample efficiency, and lack mechanisms to reason about

**SOLAR: Deep Structured Representations for Model-Based Reinforcement Learning** [Website]

- Marvin Zhang · Sharad Vikram · Laura Smith · Pieter Abbeel · Matthew Johnson · Sergey Levine
- UC Berkeley, UC San Diego, Google
- Github: github.com/sharadmv/parasol
- Blog post: https://bair.berkeley.edu/blog/2019/05/20/solar/
- Notes
- Problem: Model-based reinforcement learning (RL) has proven to be a data efficient approach for learning control tasks but is difficult to utilize in domains with complex observations such as images.
- In this paper, we present a method for learning representations that are suitable for iterative model-based policy improvement, in that these representations are optimized for inferring simple dynamics and cost models given the data from the current policy.
- In this work, we enable
`LQR-FLM`

for images using`structured representation learning`

. `Key idea:`

structured representation learning to enable accurate modeling with simple models; model-based method that does not use forward prediction

**Learning a Prior over Intent via Meta-Inverse Reinforcement Learning** [Website]

- Kelvin Xu · Ellis Ratner · EECS Anca Dragan · Sergey Levine · Chelsea Finn
- UC Berkeley
- Notes
- Problem: In practice, IRL must commonly be performed with only a limited set of demonstrations where it can be exceedingly difficult to unambiguously recover a reward function.
- In this work, we exploit the insight that demonstrations from other tasks can be used to constrain the set of possible reward functions by learning a ‘‘prior’’ that is specifically optimized for the ability to infer expressive reward functions from limited numbers of demonstrations.
- Motivation: a well specified reward function remains an important assumption for applying RL in practice
- Often easier to provide expert data and learn a reward function using inverse RL
- Inverse RL frequently requires a lot of data to learn a generalizable reward
- This is due in part with the fundamental ambiguity of reward learning

- Goal: how can agents infer rewards from one or a few demonstrations?
- Intuition: demonstrations from previous tasks induce a prior over the space of possible future tasks

- Meta-inverse reinforcement learning: using prior tasks information to accelerate inverse-RL
- Our approach: Meta reward and intention learning
- Experiments
- Domain 1: SpriteWorld environment
- Domain 2: First person navigation (SUNCG)

### 2. DeepMind

**Composing Entropic Policies using Divergence Correction** [Website]

- Jonathan Hunt · Andre Barreto · Timothy Lillicrap · Nicolas Heess
- DeepMind
- Notes
- Problem: Deep reinforcement learning algorithms have achieved remarkable successes, but often require vast amounts of experience to solve a task.
`Composing skills mastered in one task`

in order to efficiently solve novel challenges promises dramatic improvements in data efficiency. - Here, we build on two recent works composing behaviors represented in the form of action-value functions.
- 组合不同策略

- Problem: Deep reinforcement learning algorithms have achieved remarkable successes, but often require vast amounts of experience to solve a task.

**Learning Latent Dynamics for Planning from Pixels** [Website]

- Danijar Hafner · Timothy Lillicrap · Ian Fischer · Ruben Villegas · David Ha · Honglak Lee · James Davidson
- DeepMind
- Algorithm: PlaNet
- Github: github.com/google-research/planet
- Blog post: https://ai.googleblog.com/2019/02/introducing-planet-deep-planning.html
- Notes
- Problem: Learning dynamics models that are accurate enough for planning has been a long-standing challenge, especially in image-based domains.
- We propose the Deep Planning Network (PlaNet), a purely model-based agent that learns the environment dynamics from images and chooses actions through fast online planning in latent space.
- Method features
- Recipe for scalable model-based reinforcement learning
- Efficient planning in latent space with large batch size
- Reaches top performance using 200X fewer episodes

**Policy Consolidation for Continual Reinforcement Learning** [Website]

- Christos Kaplanis · Murray Shanahan · Claudia Clopath
- Imperial College London, DeepMind
- Notes
- Problem: Catastrophic forgetting in deep reinforcement learning
- We propose a method for tackling catastrophic forgetting in deep reinforcement learning that is agnostic to the timescale of changes in the distribution of experiences, does not require knowledge of task boundaries and can adapt in continuously changing environments.
- Future work
- Prioritised consolidation
- Adapt for off-policy learning

### 3. OpenAI

**Quantifying Generalization in Reinforcement Learning** [Website]

- Karl Cobbe · Oleg Klimov · Chris Hesse · Taehoon Kim · John Schulman
- OpenAI
- Github: github.com/openai/coinrun
- Blog post: https://openai.com/blog/quantifying-generalization-in-reinforcement-learning/
- Notes
- Problem: Overfitting in deep reinforcement learning
- Many common deep RL benchmarks ignore generalization
`CoinRun`

: a`procedurally generated`

environment with distinct train/test sets- Deep architectures, dropout, L2 regularization improve generalization
- Better benchmarks lead to better architectures and algorithms