Davidmanheim comments on Don’t design agents which exploit adversarial inputs

Davidmanheim 20 Nov 2022 7:29 UTC
LW: 3 AF: 2
0
AF
This relates closely to how to “solve” Goodhart problems in general. Multiple metrics / graders make exploitation more complex, but have other drawbacks. I discussed the different approaches in my paper here, albeit in the realm of social dynamics rather than AI safety.