cfoster0 comments on Don’t design agents which exploit adversarial inputs

cfoster0 18 Nov 2022 6:24 UTC
16 points
11
An exercise that helped me see the “argmax is a trap” point better was to concretely imagine what the cognitive stacktrace for an agent running argmax search over plans might look like:
```
# Traceback (most recent call last)
At line 42 in main:
    decision = agent.deliberate()
At line 3 in deliberate:
    all_plans = list(plan_generator)
    return argmax(all_plans, grader=self.grader)
At line 8 in argmax:
    for plan in plans: # includes adversarial inputs (!!)
        evaluation = apply(grader, plan)
At line 264 in apply:
    predicted_object_level_consequences = ...
KeyboardInterrupt
```
A major problem with this design is that the agent considers “What do I think would happen if I ran plan X?”, but NOT “What do I think would happen if I generated plans using method Y?”. If the agent were considering the second question as well, then it would (rightly) conclude that the method “generate all plans & run argmax search on them” would spit out a grader-fooling adversarial input, which will cause it to implement some arbitrary plan with low expected value. (Heck, with future LLMs maybe you’ll be able to just show them an article about the Optimizer’s Curse and it will grok this.) Knowing this, the agent tosses aside this forseeably-harmful-to-its-interests search method.

The natural next question is “Ok, but what other option(s) do we have?”. I’m guessing TurnTrout’s next post might look into that, and he likely has more worked out thoughts on it, so I’ll leave it to him. But I’ll just say that I don’t think coherent real-world agents will/should/need to be running cognitive stacktraces with footguns like this.