[2022-03-25]RL and Language Models

Background

Languages allow us to encode abstractions, to generalize, to communicate plans, intentions, and requirements. These are fundamentally desirable capabilities of artiﬁcial agents.
However, agents trained with RL and IL typically lack such capabilities, and struggle to efficiently learn from interactions with rich and diverse environments.

Using Natural Language for Reward Shaping in Reinforcement Learning (IJCAI 2019)

A common approach to reduce interaction time with the environment is to use reward shaping.
However, designing appropriate shaping rewards is known to be difficult as well as time-consuming.
In this work, we address this problem by using natural language instructions to perform reward shaping.

If the agent is given a positive reward only when it reaches the end of the desired trajectory, it may need to spend a significant amount of time exploring the environment to learn that behavior.
Giving the agent intermediate rewards for progress towards the goal can help (reward shaping). However, designing intermediate rewards is hard, particularly for non-experts.
Since natural language instructions can easily be provided even by non-experts, it will enable them to teach RL agents new skills more conveniently.

Notations

Consider an extension of the MDP framework, defined by $⟨ S, A, R, T, γ, l ⟩$ .
$l \in L$ is a language command describing the intended behavior (with $L$ defined as the set of all possible language commands).
We denote this language-augmented MDP as MDP+L.

LanguagE-Action Reward Network (LEARN)

Sample two distinct timesteps $i$ and $j$ from the set ${1, \dots, ∣ τ ∣}$ . Let $τ [i : j]$ denote the segment of $τ$ between timesteps $i$ and $j$ .
Create an action-frequency vector $f$ from the actions in $τ [i : j]$ . Create a dataset of $(f, l)$ pairs from a given set of $(τ, l)$ pairs.
Positive examples: created by sampling $f$ from a given trajectory $τ$ and using the language description $l$ associated with $τ$ .
Negative examples: created by (1) sampling $f$ from $τ$ , but choosing an alternate $l^{'}$ sampled uniformly at random from the data excluding $l$ , or (2) creating a random $f^{'}$ and pairing it with $l$ .

The final output of the network is a probability distribution over two classes – RELATED and UNRELATED.

Language-based Rewards

Given the sequence of actions $a_{1}, \dots, a_{t - 1}$ and the language instruction $l$ associated with the given MDP+L, create an action-frequency vector $f_{t}$ .
Let the output probabilities corresponding to classes RELATED and UNRELATED be denoted as $p_{R} (f_{t})$ and $p_{U} (f_{t})$ .
Language-based shaping rewards:
- potential function: $ϕ (f_{t}) = p_{R} (f_{t}) - p_{U} (f_{t})$
- intermediate language-based reward: $R_{l an g} (f_{t}) = γ \cdot ϕ (f_{t}) - ϕ (f_{t - 1})$

Experiments

Conducted experiments on the Atari game Montezuma’s Revenge. We selected this game because the rich set of objects and interactions allows for a wide variety of natural language descriptions.
For each of the tasks, we generate a reference trajectory, and use Amazon Mechanical Turk to obtain 3 descriptions for the trajectory.

Summary

使用language description辅助reward shaping的出发点比较有趣，但是实现方法上过于生硬。
action frequency vector并不是一个很好的设计/选择，e.g. (go left $\to$ go down $\to$ go right) vs (go right $\to$ go down $\to$ go left)，从frequency角度来看这两个description是非常相似的，但实则产生的是截然不同的trajectory

文章claim的一个重点：natural language比expert knowledge容易获取，但是实验(task 14)中的失败说明了不够精确的natural language Instruction并不容易简单地被找到和action之间的关联，如果想要获得task 6中精准的Instruction，难度并不亚于expert knowledge。（不过可以考虑从NLP的一些语义解析方法出发来改进这一缺点）

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Background

Language Model

LMs are trained to fit a distribution $p (y)$ over a text sequence $y = {y_{1}, y_{2}, \dots, y_{n}}$ via the chain rule:

p (y) = p (y_{1}) i = 2 \prod n p (y_{i} ∣ y_{1}, \dots, y_{i - 1})

Method

Plan Generation

To choose the best action plan $X^{*}$ among samples $(X_{1}, X_{2}, \dots, X_{k})$ , each consisting of $n_{i}$ tokens $X_{i} = (x_{i, 1}, x_{i, 2}, \dots, x_{i, n_{i}})$ , select the sample with highest mean log probability as follows:

X_{i} argmax (P_{θ} (X_{i}) := \frac{1}{n _{i}} j = 1 \sum n_{i} lo g p_{θ} (x_{i, j} ∣ x_{i, < j}))

Action Translation

Instead of developing a set of rules to transform the free-form text into admissible action steps, for each admissible environment action $a_{e}$ , we calculate its semantic distance to the predicted action phrase $\overset{a}{^}$ by cosine similarity:

C (f (\overset{a}{^}), f (a_{e})) := \frac{f ( a ^ ) \cdot f ( a _{e} )}{∥ f ( a ^ ) ∥ ∥ f ( a _{e} ) ∥}

where $f$ is an embedding function. (Translation LM)

Trajectory Correction

Translating each step lacks consideration of achievability of individual steps.
For each sample $\overset{a}{^}$ , we consider both its semantic soundness and its achievability in the environment.

We aim to ﬁnd admissible environment action $a_{e}$ by modifying the ranking scheme:

a_{e} argmax [\overset{a}{^} max C (f (\overset{a}{^}), f (a_{e})) + β \cdot P_{θ} (\overset{a}{^})]

Final Method

Experiments

Summary

Language Model提供了优秀的联想能力，既能做到鼓励探索的发散型预测，也能在约束条件下提供高效完成目标的plan
LM用于descion-making的优点在于能够充分地利用context信息（启发：在RL中同样不局限于Markov Property，尝试利用上更多的history？）

p (y) = p (y_{1}) i = 2 \prod n p (y_{i} ∣ y_{1}, \dots, y_{i - 1})

文章实验所使用的VirtualHome环境，action过于离散（开冰箱、取牛奶…）且space不大，也许一些常规的RL方法也一样可以学到好的policy，但文章的实验仅验证了其所生成的plan足以完成目标，但并没有定量评判和比较好坏程度。

📚 ZHANGWP

Explorer

[2022-03-25]RL and Language Models

Background

Using Natural Language for Reward Shaping in Reinforcement Learning (IJCAI 2019)

Notations

LanguagE-Action Reward Network (LEARN)

Language-based Rewards

Experiments

Summary

Language Models as Zero-Shot Planners: Extracting Actionable Knowledge for Embodied Agents

Background

Language Model

Method

Plan Generation

Action Translation

Trajectory Correction

Final Method

Experiments

Summary

Graph View

Table of Contents

Backlinks