Vanessa Kosoy comments on AI Alignment Open Thread August 2019

Vanessa Kosoy 9 Aug 2019 15:32 UTC
LW: 5 AF: 3
AF
A variant of Christiano’s IDA amenable to learning-theoretic analysis. We consider reinforcement learning with a set of observations and a set of actions, where the semantics of the actions is making predictions about future observations. (But, as opposed to vanilla sequence forecasting, making predictions affects the environment.) The reward function is unknown and unobservable, but it is known to satisfy two assumptions:

(i) If we make the null prediction always, the expected utility will be lower bounded by some constant.

(ii) If our predictions sample the $n$ -step future for a given policy $π$ , then our expected utility will be lower bounded by some function $F (u, n)$ of the the expected utility $u$ of $π$ and $n$ . $F$ is s.t. for sufficiently low $u$ , $F (u, n) \leq u$ but for sufficiently high $u$ , $F (u, n) > u$ (in particular the constant in (i) should be high enough to lead to an increasing sequence). Also, it can be argued that it’s natural to assume $F (F (u, n), m) \approx F (u, n m)$ for $u >> 0$ .

The goal is proving regret bounds for this setting. Note that this setting automatically deals with malign hypotheses in the prior, bad self-fulfilling prophecies and “corrupting” predictions that cause damage just by seeing them.

However, I expect that without additional assumptions the regret bound will be fairly weak, since the penalty for making wrong predictions grows with the total variation distance between the prediction distribution and the true distribution, which is quite harsh. I think this reflects a true weakness of IDA (or some versions of it, at least): without an explicit model of the utility function, we need very high fidelity to guarantee robustness against e.g. malign hypotheses. On the other hand, it might be possible to ameliorate this problem if we introduce an additional assumption of the form: the utility function is e.g. Lipschitz w.r.t some metric $d$ . Then, total variation distance is replaced by Kantorovich-Rubinstein distance defined w.r.t. $d$ . The question is, where do we get the metric from. Maybe we can consider something like the process of humans rewriting texts into equivalent texts.
What links here?
- Vanessa Kosoy's comment on Coherence arguments do not entail goal-directed behavior by Rohin Shah (14 Dec 2019 13:20 UTC; 13 points)