tailcalled comments on Don’t design agents which exploit adversarial inputs

tailcalled 18 Nov 2022 18:28 UTC
4 points
−7
This sounds like a reinvention of quantilization, and yes that’s a thing you can do to improve safety, but 1. you still need your prior over plans to come from somewhere (perhaps you start out with something IRL-like, and then update it based on experience of what worked, which brings you back to square one), 2. it just gives you a safety-capabilities tradeoff dial rather than particularly solving safety.
- tailcalled 18 Nov 2022 18:41 UTC
  2 points
  0
  Parent
  Or hmm...
  If you do basic reinforcement based on experience, then that’s an unbounded adversarial search, but it’s really slow and therefore might be safe. And it also raises the question of whether there are other safer approaches.