[2022-10-14]Factored Adaption for Non-stationary RL

Factored Adaptation for Non-Stationary Reinforcement Learning

A framework that learns a factored representation to adapt to non-stationarity.
We formalize a unified framework that can handle different non-stationary settings, including discrete and continuous changes, both within and across episodes.

Background

This work extends the factored representation for fast policy adaptation across domains introduced in AdaRL (ICLR 2022). Suppose there are $n$ source domains, $n^{'}$ target domains. The generative process of the environment in the $k$ -th domain with $k = 1, \dots, n + n^{'}$ can be described in terms of the transition function as

$a_{t - 1}$ 只会影响 $s_{t}$ 的部分维度； $s_{t - 1}$ 与 $s_{t}$ 的维度之间也存在结构关系。
$θ_{k}^{s}, θ_{k}^{r}$ are the change factors that have a constant value in each domain, but vary across domains.

$j$ -th dimension of $c_{i}^{s \to s} \in {0, 1}^{d}$ is $1$ $⟺$ $s_{j, t}$ inﬂuences $s_{i, t + 1}$
$c_{i}^{a \to s} \in {0, 1}^{d}$ is $1$ $⟺$ $a_{t}$ inﬂuences $s_{i, t + 1}$
$c_{i}^{θ_{k} \to s} \in {0, 1}^{d}$ encodes which components of the change factor $θ_{k}^{s} = (θ_{1, k}^{s}, \dots, θ_{p, k}^{s})$ affect $s_{i, t + 1}$

The optimal policy across domains is then learned using these compact representations $a_{t} = π^{*} (s_{t}^{m i n}, θ_{k}^{m i n})$ .

Factored Non-stationary MDPs

Continuous changes: If $g^{s}$ and $g^{r}$ are continuous, then they can model smooth changes in the environment, including within and across episodes.
Discrete changes: can be represented with a piecewise-constant function.

Sparsity loss

We encourage sparsity in the binary masks $C$ to improve identiﬁability, by using following loss with

The total objective function:

L_{vae} = k_{1} L_{rec} + k_{2} L_{pred} - k_{3} L_{KL} - k_{4} L_{sparse} - k_{4} L_{smooth}

Experiments

Baselines

meta-RL: TRIO, VariBAD
representative task embedding approache: LILAC, ZeUS
stationary: SAC, oracle

Summary

构想了一种“Change Factor”，并尝试对这种latent structure进行建模来解决non-stationarity问题
claim: can handle different non-stationary settings. 实际上会有一个较大的限制是dynamics的变化是关于time $t$ 和task index $i$ 的function

📚 ZHANGWP

Explorer