TurnTrout comments on Don’t design agents which exploit adversarial inputs

TurnTrout 21 Nov 2022 19:27 UTC
LW: 5 AF: 4
−2
AF
My sense is that you want to decline to play this game and instead say: just don’t build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn’t an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don’t arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It’s possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I’d withdraw this comment once part 2 came out though I wish you’d led with the juicy part.)
I don’t think we can or need to avoid planning per se. My position is more that certain design choices—e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself—force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.
Just to set expectations, I don’t have a proposal for capturing “the benefits of search without the risks”; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I’ll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.