**Making Deep Q-learning methods robust to time discretization** [Website]

- Corentin Tallec · Leonard Blier · Yann Ollivier
- Universit´e Paris-Sud, Facebook AI Research
- Notes
- Problem: Deep Reinforcement Learning is not robust to hyperparameterization, implementation details, or small environment changes.
- In this paper, we identify sensitivity to
`time discretization`

in`near continuous-time environments`

as a critical factor; this covers, e.g.,`changing the number of frames per second`

, or`the action frequency of the controller`

.

**Nonlinear Distributional Gradient Temporal-Difference Learning** [Website]

- chao qu · Shie Mannor · Huan Xu
- Ant Financial Services Group
- Notes
- We devise a
`distributional`

variant of gradient temporal-difference (TD) learning. - 分布式强化学习

- We devise a

**TibGM: A Transferable and Information-Based Graphical Model Approach for Reinforcement Learning** [Website]

- Tameem Adel · Adrian Weller
- Machine Learning Group, University of Cambridge
- Notes
- Problem: Hierarchical reinforcement learning (HRL) can provide a principled solution to the RL challenge of scalability for complex tasks. By incorporating a graphical model (GM) and the rich family of related methods, there is also hope to address issues such as
`transferability, generalisation and exploration`

. - We propose a flexible GM-based HRL framework which leverages efficient inference procedures to enhance generalisation and transfer power.

- Problem: Hierarchical reinforcement learning (HRL) can provide a principled solution to the RL challenge of scalability for complex tasks. By incorporating a graphical model (GM) and the rich family of related methods, there is also hope to address issues such as

**Multi-Agent Adversarial Inverse Reinforcement Learning** [Website]

- Lantao Yu · Jiaming Song · Stefano Ermon
- Stanford University
- Notes
- Problem: Inverse reinforcement learning provides a framework to automatically acquire suitable reward functions from expert demonstrations. Its extension to multi-agent settings, however, is difficult due to the more complex notions of rational behaviors.
- In this paper, we propose MA-AIRL, a new framework for multi-agent inverse reinforcement learning, which is effective and scalable for Markov games with high-dimensional state-action space and unknown dynamics.

**Off-Policy Deep Reinforcement Learning without Exploration** [Website]

- Scott Fujimoto · David Meger · Doina Precup
- McGill University
- Github: github.com/sfujim/BCQ
- Notes
- Problem: Many practical applications of reinforcement learning constrain agents to learn from
`a fixed batch of data which has already been gathered`

, without offering further possibility for data collection. - In this paper, we demonstrate that due to errors introduced by extrapolation, standard off-policy deep reinforcement learning algorithms, such as DQN and DDPG, are incapable of learning with
`data uncorrelated to the distribution under the current policy`

, making them ineffective for this fixed batch setting. - We introduce a novel class of off-policy algorithms, batch-constrained reinforcement learning, which restricts the action space in order to force the agent towards behaving close to on-policy with respect to a subset of the given data.

- Problem: Many practical applications of reinforcement learning constrain agents to learn from

**Random Expert Distillation: Imitation Learning via Expert Policy Support Estimation** [Website]

- Ruohan Wang · Carlo Ciliberto · Pierluigi Vito Amadori · Yiannis Demiris
- Imperial College London
- Github: github.com/RuohanW/RED
- Notes:
- Problem:
- Imitation learning from a finite set of expert trajectories, without access to reinforcement signals
- The classical approach of extracting the expert’s reward function via inverse reinforcement learning, followed by reinforcement learning is indirect and may be computationally expensive.

`Random Expert Distillation`

is a new framework for imitation learning, using the estimated support of the expert policy as reward.

- Problem:

**Revisiting the Softmax Bellman Operator: New Benefits and New Perspective** [Website]

- Zhao Song · Ron Parr · Lawrence Carin
- Baidu Research, Duke University
- Github: github.com/zhao-song/Softmax-DQN
- Notes
- Problem: The impact of
`softmax on the value function`

itself in reinforcement learning (RL) is often viewed as problematic because it leads to sub-optimal value (or Q) functions and interferes with the contraction properties of the`Bellman operator`

. - 探究Q-Learning中贝尔曼方程的最大化操作带来的影响

- Problem: The impact of

**Distributional Reinforcement Learning for Efficient Exploration** [Website]

- Borislav Mavrin · Hengshuai Yao · Linglong Kong · Kaiwen Wu · Yaoliang Yu
- University of Alberta, Huawei Technologies, University of Waterloo
- Notes
- Problem: In distributional reinforcement learning (RL), the estimated distribution of value functions model both the
`parametric and intrinsic uncertainties`

. - We propose a novel and efficient exploration method for deep RL that has two components.
- The first is a decaying schedule to suppress the intrinsic uncertainty.
- The second is an exploration bonus calculated from the upper quantiles of the learned distribution.

- A particularly interesting application of the (Distributional) RL approach is driving safety.

- Problem: In distributional reinforcement learning (RL), the estimated distribution of value functions model both the

**Learning Action Representations for Reinforcement Learning** [Website]

- Yash Chandak · Georgios Theocharous · James Kostas · Scott Jordan · Philip Thomas
- University of Massachusetts Amherst,
`Adobe Research`

- Notes
- Problem: Most model-free reinforcement learning methods leverage state representations (embeddings) for generalization, but either ignore structure in the space of actions or assume the structure is provided a priori.
- We show how a policy can be decomposed into a component that acts in a low-dimensional space of action representations and a component that transforms these representations into actual actions.
- Key Insights
- Actions are not independent discrete quantities.
- There is a low dimensional structure underlying their behavior pattern.
- This structure can be learned independent of the reward.
- Instead of raw actions, agent can act in this space of behavior and feedback can be generalized to similar actions.

- Algorithm
- (a) Supervised learning of action representations.
- (b) Learning internal policy with policy gradients.

**Control Regularization for Reduced Variance Reinforcement Learning** [Website]

- Richard Cheng · Abhinav Verma · Gabor Orosz · Swarat Chaudhuri · Yisong Yue · Joel Burdick
- California Institute of Technology, Rice University, University of Michigan, Caltech
- Github: github.com/rcheng805/CORE-RL
- Notes
- Problem: Dealing with
`high variance`

is a significant challenge in model-free reinforcement learning (RL). Existing methods are unreliable, exhibiting high variance in performance from run to run using different initializations/seeds. - Focusing on problems arising in continuous control, we propose a functional regularization approach to augmenting model-free RL.
- Control Regularization helps by providing:
- Reduced variance
- Higher rewards
- Faster learning
- Potential safety guarantees

- Problem: Dealing with

**On the Generalization Gap in Reparameterizable Reinforcement Learning** [Website]

- Huan Wang · Stephan Zheng · Caiming Xiong · Richard Socher
- Salesforce Research
- Notes
- Problem: Understanding generalization in reinforcement learning (RL) is a significant challenge, as many common assumptions of traditional supervised learning theory do not apply.
- We argue that the gap between training and testing performance of RL agents is caused by two types of errors: intrinsic error due to the randomness of the environment and an agent’s policy, and external error by the change of environment distribution.

**Remember and Forget for Experience Replay** [Website]

- Guido Novati · Petros Koumoutsakos
- ETH Zurich
- Github: github.com/cselab/smarties
- Notes
- Problem: ER recalls experiences from past iterations to compute gradient estimates for the current policy, increasing data-efficiency. However, the accuracy of such updates may deteriorate when the policy diverges from past behaviors and can undermine the performance of ER.
- An alternative is to actively
`manage the experiences in the replay memory`

. We introduce Remember and Forget Experience Replay (ReF-ER), a novel method that can enhance RL algorithms with parameterized policies. - ReF-ER
- (1) skips gradients computed from experiences that are too unlikely with the current policy
- (2) regulates policy changes within a trust region of the replayed behaviors.

- We couple ReF-ER with Q-learning, deterministic policy gradient and off-policy gradient methods.

**Maximum Entropy-Regularized Multi-Goal Reinforcement Learning** [Website]

- Rui Zhao · Xudong Sun · Volker Tresp
- Siemens AG, Ludwig Maximilian University of Munich
- Github: github.com/ruizhaogit/mep
- Notes
- Problem: In Multi-Goal Reinforcement Learning, an agent learns to achieve multiple goals with a goal-conditioned policy. During learning, the agent first collects the trajectories into a replay buffer and later these trajectories are selected randomly for replay. However, the achieved goals in the replay buffer are often biased towards the behavior policies.
- We want to encourage the agent to achieve a diverse set of goals while maximizing the expected return.
- Contributions
- First, we propose a novel multi-goal RL objective based on weighted entropy, which is essentially a reward-weighted entropy objective.
- Secondly, we derive a safe surrogate objective, that is, a lower bound of the original objective, to achieve stable optimization.
- Thirdly, we developed a Maximum Entropy-based Prioritization (MEP) framework to optimize the derived surrogate objective.

**Imitating Latent Policies from Observation** [Website]

- Ashley Edwards · Himanshu Sahni · Yannick Schroecker · Charles Isbell
- Georgia Institute of Technology
- Github: github.com/ashedwards/ILPO
- Notes
- Problem: Imitation learning
- Motivation
- Imitation from Observation enables learning from state sequences
- Typical approaches need extensive environment interactions
- Humans can learn policies just by watching

- In this paper, we describe a novel approach to imitation learning that infers latent policies directly from state observations.

**Dimension-Wise Importance Sampling Weight Clipping for Sample-Efficient Reinforcement Learning** [Website]

- Seungyul Han · Youngchul Sung
- KAIST (Korea)
- Notes
- Problem: Limitations of PPO
- PPO has vanishing gradient problem in high dimensional tasks.
- On-policy learning of PPO is sample-inefficient.

- To overcome these drawbacks, we propose
`Dimension-wise importance sampling weight clipping (DISC)`

: Solve the vanishing gradient problem.`Off-policy generalization`

: Reuse old samples to enhance the sample-efficiency

- Conclusion
- DISC extends PPO by dimension-wise IS clipping and off-policy generalization.
- DISC solves the vanishing gradient problem and enhances sample-efficiency.
- DISC achieves top-level performance as compared to other state-of-the-art RL algorithms.

- Problem: Limitations of PPO

**Learning Novel Policies For Tasks** [Website]

- Yunbo Zhang · Wenhao Yu · Greg Turk
- Georgia Institute of Technology
- Notes
- Problem: Finding multiple distinct solutions for a particular task is a challenging problem for reinforcement learning algorithms.
- Motivation
- Want more than one solution (i.e. novel solutions) to a problem.
- E.g. Different Locomotion styles for legged robots.

- Want more than one solution (i.e. novel solutions) to a problem.
- Key Aspects
- Novelty measurement function
- Measures the novelty of a trajectory compared with trajectories from other policies

- Policy Gradient Update
- Make sure final gradient compromises between task and novelty
- Task-Novelty Bisector (TNB)

- Novelty measurement function
- Method Overview
- Define a separate novelty reward function apart from task reward.
- Train a policy using Task-Novelty Bisector (TNB) to balance the optimization of task and novelty.
- Update novelty measurement function.
- Repeat

**Taming MAML: Efficient unbiased meta-reinforcement learning** [Website]

- Hao Liu · Richard Socher · Caiming Xiong
- Salesforce Research, UC Berkeley
- Github: github.com/lhao499/taming-maml
- Notes
- Problem: While meta reinforcement learning (Meta-RL) methods have achieved remarkable success, obtaining correct and low variance estimates for policy gradients remains a significant challenge.
- Computational Efficient Solution: TMAML
- Idea: surrogate function + scalable control variates

- TMAML reduced meta-gradient variance and improve performance
- MAML (Finn et al 2017) is biased
- DICE (Foerster et al 2018) is unbiased & high variance
- LVC (Rothfuss et al 2019) is biased & low Variance
- TMAML is unbiased & low variance

**Self-Supervised Exploration via Disagreement** [Website]

- Deepak Pathak · Dhiraj Gandhi · Abhinav Gupta
- UC Berkeley, Carnegie Mellon University
- Github: github.com/pathak22/exploration-by-disagreement
- Notes
- Problem:
`Exploration`

has been a long standing problem in both model-based and model-free learning methods for sensorimotor control. There have been major advances in recent years demonstrated in`noise-free, non-stochastic`

domains such as video games and simulation. However, most of the current formulations get stuck when there are`stochastic dynamics`

. - In this paper, we propose a formulation for exploration inspired from the work in active learning literature.
- Question: Where is
`"Disagreement"`

?

- Problem:

**The Natural Language of Actions** [Website]

- Guy Tennenholtz · Shie Mannor
- Technion Institute of Technology
- Notes
- We introduce Act2Vec, a general framework for learning
`context-based action representation`

for Reinforcement Learning. - Act2Vec in Reinforcement Learning
- Representation
- State-Action Value Approximation
- Exploration
- Domain Transfer

- Conclusion
- Actions are thought of as symbols of a natural, expressive language
- Actions represent thoughts, beliefs, and strategies
- Their meaning is captured through their context
- Similarity between actions helps reduce cardinality of action space
- Prior knowledge is incorporated through action representations

- We introduce Act2Vec, a general framework for learning

**Batch Policy Learning under Constraints** [Website]

- Hoang Le · Cameron Voloshin · Yisong Yue
- Caltech
- Notes
- Problem: When learning policies for real-world domains, two important questions arise
- (1) how to efficiently use existing off-line, off-policy, non-optimal behavior data
- (2) how to mediate among different competing objectives and constraints.

- We study the problem of batch policy learning under multiple constraints and offer a systematic solution.

- Problem: When learning policies for real-world domains, two important questions arise

**Calibrated Model-Based Deep Reinforcement Learning** [Website]

- Ali Malik · Volodymyr Kuleshov · Jiaming Song · Danny Nemer · Harlan Seymour · Stefano Ermon
- Stanford Universtiy, Afresh Technologies
- Notes
- Problem:
`Accurate estimates of predictive uncertainty`

are important for building effective model-based reinforcement learning agents. However, predictive uncertainties — especially ones derived from modern neural networks — are often inaccurate and impose a bottleneck on performance. - Key question:
- Which uncertainties are important in Model-Based Reinforcement Learning?

- What constitutes good probabilistic forecasts?
**Sharpness.**Predictive distributions should be focused i.e have low variance**Calibration.**Uncertainty should be empirically accurate i.e. true value should fall in a p% confidence interval p% of the time

- Problem:

**Finding Options that Minimize Planning Time** [Website]

- Yuu Jinnai · David Abel · David Hershkowitz · Michael L. Littman · George Konidaris
- Brown University, Carnegie Mellon University
- Notes
- Problem: Selecting the optimal set of options for planning
- Contributions
**Formally define**the problem of finding an optimal set of options for planning- The complexity of computing an optimal set of options is
**NP-hard** **Approximation algorithm**for computing optimal options (under conditions)**Experimental evaluation**to compare with existing heuristic algorithms

**Reinforcement Learning in Configurable Continuous Environments** [Website]

- Alberto Maria Metelli · Emanuele Ghelfi · Marcello Restelli
- Politecnico di Milano
- Github: github.com/albertometelli/remps
- Notes
- Problem: Configurable Markov Decision Processes (Conf-MDPs) have been recently introduced as an extension of the usual MDP model to account for the possibility of configuring the environment to improve the agent’s performance.
`Currently, there is still no suitable algorithm to solve the learning problem for real-world Conf-MDPs.`

- Problem: Configurable Markov Decision Processes (Conf-MDPs) have been recently introduced as an extension of the usual MDP model to account for the possibility of configuring the environment to improve the agent’s performance.

**Projections for Approximate Policy Iteration Algorithms** [Website]

- Riad Akrour · Joni Pajarinen · Jan Peters · Gerhard Neumann
- TU Darmstadt
- Notes
- Problem: Approximate policy iterations is a class of reinforcement learning algorithms where both the value function and policy are encoded using function approximators and which has been especially prominent in continuous action spaces. However, by encoding the policy with a function approximator, it often becomes necessary to constrain the change in action distribution during policy update to ensure increase of the policy return.
- Several approximations exist in the literature to solve this constrained policy update problem. In this paper, we propose to
`improve`

over such solutions by introducing a set of projections that transform the constrained problem into an unconstrained one which is then solved by standard gradient descent.

**Learning from a Learner** [Website]

- alexis jacq · Matthieu Geist · Ana Paiva · Olivier Pietquin
- Google Research, University of Lisbon
- Notes
- In this paper, we propose a novel setting for
`Inverse Reinforcement Learning (IRL)`

, namely`"Learning from a Learner" (LfL)`

. As opposed to standard IRL, it does not consist in learning a reward by observing an optimal agent but from observations of another learning (and thus sub-optimal) agent. - To recover that reward in practice, we propose methods based on the
`entropy-regularized policy iteration framework`

.

- In this paper, we propose a novel setting for

**DeepMDP: Learning Continuous Latent Space Models for Representation Learning** [Website]

- Carles Gelada · Saurabh Kumar · Jacob Buckman · Ofir Nachum · Marc Bellemare
- Google Brain, Johns Hopkins University
- Notes
- Problem: Many reinforcement learning tasks provide the agent with
`high-dimensional observations`

that can be simplified into`low-dimensional continuous states`

.

- Problem: Many reinforcement learning tasks provide the agent with

**Predictor-Corrector Policy Optimization** [Website]

- Ching-An Cheng · Xinyan Yan · Nathan Ratliff · Byron Boots
- Georgia Tech, NVIDIA
- Notes
- We present a predictor-corrector framework, called
`PicCoLO`

, that can transform a`first-order model-free reinforcement`

or`imitation learning`

algorithm into a new hybrid method that`leverages predictive models`

to`accelerate policy learning`

- Main idea- We should not fully trust a model (e.g. the methods in the biased world) but leverage only the correct part

- How?
- Frame policy optimization as predictable online learning (POL)
- Design a reduction-based algorithm for POL to reuse known algorithms
- When translated back, this gives a meta-algorithm for policy optimization

- We present a predictor-corrector framework, called

**Action Robust Reinforcement Learning and Applications in Continuous Control** [Website]

- Chen Tessler · Chen Tessler · Yonathan Efroni · Yonathan Efroni · Shie Mannor · Shie Mannor
- Technion (以色列理工学院)
- Notes
- Problem: A policy is said to be
`robust`

if it maximizes the reward while considering a bad, or even adversarial, model. - In this work we formalize two new criteria of robustness to action uncertainty.
- Specifically, we consider two scenarios in which the agent attempts to perform an action, and (i) with probability α, an alternative adversarial action is taken, or (ii) an adversary adds a perturbation to the selected action in the case of continuous action space.

- Problem: A policy is said to be