Rohin Shah comments on Do what we mean vs. do what we say

Rohin Shah 22 Aug 2020 2:43 UTC
LW: 4 AF: 3
AF
Thanks! I like it less now, but I suppose that’s to be expected (I expect I publish posts when I’m most confident in the ideas in them).
I do think it’s aged better than my other (non-public) writing at the time, so at least past-me was calibrated on which of my thoughts were good, at least according to current-me?
The main way in which my thinking differs is that I’m less optimistic about defining things in terms of what “optimizing” is happening—it seems like such a definition would be too vague / fuzzy / filled with edge cases to be useful for AI alignment. I do think that the definition could be used to construct formal models that can be analyzed (as had already been done in assistance games / CIRL or the off switch game).
The definition is also flawed; it clearly can’t be just about optimizing a latent variable, since that’s true of any POMDP. What the agent ends up optimizing for depends entirely on how the latent variable connects to the agent’s observations; this clearly isn’t enough for do what we mean. I think the better version is Stuart’s three principles in Human Compatible (summary).