[2022-11-18]RL with Causal Reasoning

Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning

Preliminary

$M = (S, A, P, R, G)$
$G \subset S$ is the goal space
$r (s, g) = 1 (s = g) \in R$ is the sparse deterministic reward function

通过在不同goal的分布 $p_{t r ain} (g)$ 和 $p_{t es t} (g)$ 上采样来生成不同任务，进而探索goal-conditioned generalization problem

Causal Reasoning with Graphical Models

random variables $X = (X_{1}, \dots, X_{d})$ with index set $V := {1, \dots, d}$
A graph $G = (V, E)$ consists of nodes $V$ and edges $E \subset V^{2}$
A node $i$ is called a parent of $j$ if $e_{ij} \in E$ and $e_{ji} \in / E$ . The set of parents of $j$ is denoted by $PA_{j}^{G}$ .

GCRL as Latent Variable Models

从probabilistic inference的角度, 目标是解决likelihood maximization problem for $p (τ ∣ s^{*})$ with $s^{*} := 1 (g = s^{T})$ . 将graph $G$ 作为latent variable，可以将 $p (τ ∣ s^{*})$ 分解后得到ELBO：

$lo g p (s^{0})$ 和 $lo g p (g)$ 是常数（uniform distribution），因此maximize $p (τ ∣ s^{*})$ 可以转换成objective：

Intuition：为了解上述优化问题，需要交替更新 $ϕ$ (causal discovery)和 $θ$ (model and policy learning)

Model learning

We propose to model the transition corresponding to G with a collection of neural networks $f_{θ} (G) := {f_{θ_{j}}}_{j = 1}^{M}$ to obtain

s_{j}^{t + 1} = f_{θ_{j}} ([PA_{j}^{G}]^{t}, U_{j})

$[PA_{j}^{G}]^{t}$ represents the values of all parents of node $s_{j}^{t}$ at time step $t$
$U_{j}$ follows Gaussian noise $U_{j} \sim N (0, I)$

Policy learning with planning

$Q (s^{t}, a^{t}) = E [\sum_{t^{'} = 0}^{H} γ^{t^{'}} r (s^{t^{'} + t}, a^{t^{'} + t}) ∣ s^{t}, a^{t}]$
MPC (random shooting): $\overset{π}{^} (s^{t}) = ar g max_{a^{t} \in A} Q_{θ}^{G} (s^{t}, a^{t})$

Data-Efficient Causal Discovery

此时causal discovery得到简化： $max_{ϕ} J (θ, ϕ)$

restrict the posterior $q_{ϕ} (G ∣ τ)$ to point mass distribution and use a threshold $η$ to control the sparsity.
perform the discovery process from the classification perspective by proposing binary classifiers $q_{ϕ} (e_{ij} ∣ τ, η)$ to determine the existence of an edge $e_{ij}$ .
$η$ is the threshold for the p-value of the hypothesis. A larger $η$ corresponds to harder sparsity constraints, leading to a sparse $G$ since two nodes are more likely to be considered independent.

According to the definition 3, we only need to conduct classification to edges connecting nodes between $U$ and $V$ . If two nodes are dependent, we add one edge directed from the node in $U$ to the node in $V$ .

Analysis of Performance Guarantee

causal graph越好，model learning效果越好

model learning效果越好，value function越接近optimal

想要控制bound，需要更好的policy（因此需要交替进行model learning和policy learning）

Experiments

{width=50%}

Summary & Thoughts

通过学习causal transition model来提升generality
结合causality相关的理论可以带来更好的可解释性？
是对causal graph的显式估计，训练难度大
offline效果差但是更实际。优化offline？

s_{j}^{t + 1} = f_{θ_{j}} ([PA_{j}^{G}]^{t}, U_{j})

P (s^{'} ∣ s, a; z) & π (a ∣ s; z)

z \in Z ≐ {z_{1}, z_{2}, \dots, z_{n}}

Problem: 不存在关系 $p (z_{t + 1} ∣ z_{t})$ , how to learn $z$ ?

如果 $p (z_{t + 1} ∣ z_{t})$ 不存在或未知，无法显式预测得到 $z$
如果隐式encode $z$ ，和其他方法没有大的区别
需要增加额外信息才能不依赖 $p (z_{t + 1} ∣ z_{t})$ 得到 $z$ ？e.g., 增加assumption: $z$ 与current information存在关联

P (s^{'} ∣ s, a; G [(s, a, \dots) \to z])

但是 $G [(s_{t}, a_{t}, \dots) \to z_{t}]$ 与 $G [(s_{t + 1}, a_{t + 1}, \dots) \to z_{t + 1}]$ 之间会间接因为 $(s_{t + 1} ∣ s_{t}, a_{t})$ 的关系产生 $(z_{t + 1} ∣ z_{t})$ 的结构关联？
做一些验证
考虑其他角度

📚 ZHANGWP

Explorer