Vanessa Kosoy answers What’s a Decomposable Alignment Topic?

Vanessa Kosoy 22 Aug 2020 13:41 UTC
LW: 6 AF: 4
AF
Motivation: Starting the theoretical investigation of dialogic reinforcement learning (DLRL).

Topic: Consider the following setting. $A$ is a set of “actions”, $Q$ is a set of “queries”, $N$ is a set of “annotations”. $W$ is the set of “worlds” defined as $W := [0, 1]^{A} \times {g, b, u}^{Q \times N}$ . Here, the semantics of the first factor is “mapping from actions to rewards”, the semantics of the second factor is “mapping from queries to {good, bad, ugly}”, where “good” means “query can be answered”, “bad” means “query cannot be answered”, “ugly” means “making this query loses the game”. In addition, we are given a fixed mapping $σ : Q \to R^{W}$ (assigning to each query its semantics). $H$ is a set of “hypotheses” which is a subset of $Δ W$ (i.e. each hypothesis is a belief about the world).

Some hypothesis $h^{*} \in H$ represents the user’s beliefs, but the agent doesn’t know which. Instead, it only has a prior $ζ \in Δ H$ . On each round, the agent is allowed to either make an annotated query $(q, n) \in Q \times N$ or take an action from $A$ . Taking an action produces a reward and ends the game. Making a query can either (i) produce a number, which is $E_{h^{*}} [σ (q)]$ (good), or (ii) produce nothing (bad), or (iii) end the game with zero reward (ugly).

The problem is devising algorithms for the agent, s.t., in expectation w.r.t. $h^{*} \sim ζ$ , the $h^{*}$ -expected reward approximates the best possible $h^{*}$ -expected reward (the latter is what we would get if the agent knew which hypothesis is correct) and the number of queries is low. Propose sets of assumptions about the ingredients of the setting that lead to non-trivial bounds. Consider proving both positive results and negative results (the latter meaning: “no algorithm can achieve a bound better than...”)

Strategy: See the theoretical research part of my other answer. I advise to start by looking for the minimal simplification of the setting about which it is still possible to prove non-trivial results. In addition, start with bounds that scale with the sizes of the sets in question, proceed to look for more refined parameters (analogous to VC dimension in offline learning).