Generalizing Goal-Conditioned Reinforcement Learning with Variational Causal Reasoning
Preliminary
- is the goal space
- is the sparse deterministic reward function
通过在不同goal的分布和上采样来生成不同任务,进而探索goal-conditioned generalization problem
Causal Reasoning with Graphical Models
- random variables with index set
- A graph consists of nodes and edges
- A node is called a parent of if and . The set of parents of is denoted by .
GCRL as Latent Variable Models
从probabilistic inference的角度, 目标是解决likelihood maximization problem for with . 将graph 作为latent variable,可以将分解后得到ELBO:
和是常数(uniform distribution),因此maximize 可以转换成objective:
Intuition:为了解上述优化问题,需要交替更新 (causal discovery)和 (model and policy learning)
Model learning
We propose to model the transition corresponding to G with a collection of neural networks to obtain
- represents the values of all parents of node at time step
- follows Gaussian noise
Policy learning with planning
- MPC (random shooting):
Data-Efficient Causal Discovery
此时causal discovery得到简化:
- restrict the posterior to point mass distribution and use a threshold to control the sparsity.
- perform the discovery process from the classification perspective by proposing binary classifiers to determine the existence of an edge .
- is the threshold for the p-value of the hypothesis. A larger corresponds to harder sparsity constraints, leading to a sparse since two nodes are more likely to be considered independent.
According to the definition 3, we only need to conduct classification to edges connecting nodes between and . If two nodes are dependent, we add one edge directed from the node in to the node in .
Analysis of Performance Guarantee
- causal graph越好,model learning效果越好
- model learning效果越好,value function越接近optimal
- 想要控制bound,需要更好的policy(因此需要交替进行model learning和policy learning)
Experiments
{width=50%}
Summary & Thoughts
- 通过学习causal transition model来提升generality
- 结合causality相关的理论可以带来更好的可解释性?
- 是对causal graph的显式估计,训练难度大
- offline效果差但是更实际。优化offline?
Problem: 不存在关系 , how to learn ?
- 如果不存在或未知,无法显式预测得到
- 如果隐式encode ,和其他方法没有大的区别
- 需要增加额外信息才能不依赖得到?e.g., 增加assumption: 与current information存在关联
- 但是与之间会间接因为的关系产生的结构关联?
- 做一些验证
- 考虑其他角度