paulfchristiano comments on Don’t design agents which exploit adversarial inputs

paulfchristiano 18 Nov 2022 18:01 UTC
LW: 14 AF: 10
10
AF
My perspective is:
- Planning against a utility function is an algorithmic strategy that people might use as a component of powerful AI systems. For example, they may generate several plans and pick the best or use MCTS or whatever. (People may use this explicitly on the outside, or it may be learned as a cognitive strategy by an agent.)
- There are reasons to think that systems using this algorithm would tend to disempower humanity. We would like to figure out how to similarly powerful AI systems that don’t do that.
- We don’t currently have candidate algorithms that can safely substitute for planning. So we need to find an alternative.
- Right now the only thing remotely close to working for this purpose is running a very similar planning algorithm but against a utility function that does not incentivize disempowering humanity.
My sense is that you want to decline to play this game and instead say: just don’t build AI systems that search for high-scoring plans.
That might be OK if it turns out that planning isn’t an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don’t arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
(It’s possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I’d withdraw this comment once part 2 came out though I wish you’d led with the juicy part.)
As a secondary point (that I’ve said a bunch of times), I also found the arguments in this post uncompelling. Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor’s actions, and the grader being something that operates on plans in the agent’s head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
- TurnTrout 21 Nov 2022 19:14 UTC
  LW: 11 AF: 7
  2
  AF Parent
  Probably the first thing to clarify is that I feel like you equivocate between the grader being something that is embedded in the real world and hence subject to manipulation by real-world consequences of the actor’s actions, and the grader being something that operates on plans in the agent’s head in order to select the best one. In the latter case the grader is still subject to manipulation, but the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
  This seems like a misunderstanding. While I’ve previously communicated to you arguments about problems with manipulating embedded grading functions, that is not at all what this post is intended to be about. I’ll edit the post to make the intended reading more obvious. None of this post’s arguments rely on the grader being embedded and therefore physically manipulable. As I wrote in footnote 1:
  I’m not assuming the actor wants to maximize the literal physical output of the grader, but rather just the “spirit” of the grader. More formally, the actor is trying to ${argmax}_{plan p} G r a d e r (p)$ , where Grader can be defined over the agent’s internal plan ontology.
  Anyways, replying in particular to:
  the prospects for manipulation seems unrelated to the open-endedness of the domain and unrelated to taking dangerous actions.
  Open-ended domains are harder to grade robustly on all inputs because more stuff can happen, and the plan space gets exponentially larger since the branching factor is the number of actions. EG it’s probably far harder to produce an emotionally manipulative-to-the-grader DOTA II game state (e.g. I look at it and feel compelled to output a ridiculously high number), than a manipulative state in the real world (which plays to e.g. their particular insecurities and desires, perhaps reminding them of triggering events from their past in order to make their judgments higher-variance).
  What links here?
  - Don’t align agents to evaluations of plans by TurnTrout (26 Nov 2022 21:16 UTC; 45 points)
- TurnTrout 21 Nov 2022 19:27 UTC
  LW: 5 AF: 4
  −2
  AF Parent
  My sense is that you want to decline to play this game and instead say: just don’t build AI systems that search for high-scoring plans.
  That might be OK if it turns out that planning isn’t an effective algorithmic ingredient, or if you can convince people not to build such systems because it is dangerous (and similar difficulties don’t arise if agents learn planning internally). But failing that, we are going to have to figure out how to build AI systems that capture the benefits of planning without being dangerous.
  (It’s possible you instead have a novel proposal for a way to capture the benefits of search without the risks, in case I’d withdraw this comment once part 2 came out though I wish you’d led with the juicy part.)
  I don’t think we can or need to avoid planning per se. My position is more that certain design choices—e.g. optimizing the output of a grader with a diamond-value, instead of actually having the diamond-value yourself—force you to solving ridiculously hard subproblems, like robustness against adversarial inputs in the exponential-in-planning-horizon plan space.
  Just to set expectations, I don’t have a proposal for capturing “the benefits of search without the risks”; if you give value-child bad values, he will kill you. But I have a proposal for how several apparent challenges (e.g. robustness to adversarial inputs proposed by the actor) are artifacts of e.g. the design patterns I outlined in this post. I’ll outline why I think that realistic (e.g. not argmax) cognition/motivational structures automatically avoid these extreme difficulties.