jacob_cannell comments on Don’t design agents which exploit adversarial inputs

jacob_cannell Nov 18, 2022, 10:03 PM
5 points
3

Another reason to think about argmax in relation to AI safety/alignment is if you design an AI that doesn’t argmax (or do its best to approximate argmax),

Actual useful AGI will not be built from argmax, because it’s not really useful for efficient approximate planning. You have exponential (in time) uncertainty from computational approximation and fundamental physics. This results in uncertainty over future state value estimates, and if you try to argmax with that uncertainty you are just selecting for noise. The correct solutions for handling uncertainty lead to something more like softmax or soft actor critic which avoids these issues (and also naturally leads to empowerment as an emergent heuristic).

So argmax is only useful in toy problem domains, mostly worthless for real world planning. To the extent much of standard alignment arguments now rests on this misunderstanding, those arguments are misfounded.
- Wei Dai Nov 19, 2022, 2:14 PM
  3 points
  2
  Parent
  Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
  
  The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function, it would quickly become superintelligent and ignore orders to shut down because shutting down has lower expected utility than not shutting down. It seems to me that replacing the argmax in the AI’s decision procedure with softmax results in the same outcome, since the AI’s estimated expected utility of not shutting down would be vastly greater than shutting down, resulting in a softmax of near 1 for that option.
  
  Am I misunderstanding something in the paragraph above, or do you have other arguments in mind?
  - jacob_cannell Nov 19, 2022, 7:28 PM
    4 points
    1
    Parent
    
    Which of the standard alignment arguments do you think no longer hold up if we replace argmax with softmax?
    
    The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
    
    The first one that comes to my mind is: suppose we live in a world where intelligence explosion is possible, and someone builds an AI with flawed utility function,
    
    If you assume you’ve already completely failed then the how/why is less interesting.
    
    The argmax argument expounded further is that any slight imperfection in the utility function results in doom, because of adversarial optimization magnifying that slight imperfection as you extend the planning horizon into the far future and improve planning/modeling precision.
    
    But that isn’t actually how it works. Instead due to compounding planning uncertainty far future value distributions are high variance and you get convergence to empowerment as I mentioned in the linked discussion.
    
    But that’s good news because it means that small mis-specifications in the utility function model converge away rather than diverging to infinity. The planning trajectory just converges to empowerment, regardless of the utility function, so this is good news for alignment.
    - Wei Dai Nov 19, 2022, 10:28 PM
      4 points
      3
      Parent
      
      The specific argument that you just referenced in your earlier comment: that argmax is important for competitiveness, but that argmax is inherently unsafe because of adversarial optimization (“argmax is a trap”).
      
      Assuming softmax is important for competitiveness instead, I don’t see why this argument doesn’t go through with “argmax” replaced by “softmax” throughout (including the “argmax is a trap” section of the OP). I read your linked comment and post, and still don’t understand. I wonder what the authors of the OP (or anyone else) think about this.