jacob_cannell comments on Don’t design agents which exploit adversarial inputs

jacob_cannell 19 Nov 2022 19:28 UTC
4 points
1

Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?

The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).

The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,

If you assume you’ve already completely failed then the how/why is less interesting.

The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.

But that isn’t actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.

But that’s good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.
- Wei Dai 19 Nov 2022 22:28 UTC
  4 points
  3
  Parent
  
  The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
  
  Assuming softmax is important for competitiveness instead, I don’t see why this argument doesn’t go through with “argmax” replaced by “softmax” throughout (including the “argmax is a trap” section of the OP). I read your linked comment and post, and still don’t understand. I wonder what the authors of the OP (or anyone else) think about this.