Rohin Shah comments on TurnTrout’s shortform feed

Rohin Shah 1 Jul 2023 14:32 UTC
LW: 25 AF: 14
1
AF
Since I’m an author on that paper, I wanted to clarify my position here. My perspective is basically the same as Steven’s: there’s a straightforward conceptual argument that goal-directedness leads to convergent instrumental subgoals, this is an important part of the AI risk argument, and the argument gains much more legitimacy and slightly more confidence in correctness by being formalized in a peer-reviewed paper.
I also think this has basically always been my attitude towards this paper. In particular, I don’t think I ever thought of this paper as providing any evidence about whether realistic trained systems would be goal-directed.
Just to check that I wasn’t falling prey to hindsight bias, I looked through our Slack history. Most of it is about the technical details of the results, so not very informative, but the few conversations on higher-level discussion I think overall support this picture. E.g. here are some quotes (only things I said):
Nov 3, 2019:
I think most formal / theoretical investigation ends up fleshing out a conceptual argument I would have accepted, maybe finding a few edge cases along the way; the value over the conceptual argument is primarily in the edge cases, getting more confidence, and making it easier to argue with
Dec 11, 2019:
my prediction is that agents will behave as though their reward is time-dependent / history-dependent, like humans do
We will deploy agents whose revealed specification / reward if we take the intentional stance towards them are non-Markovian