ryan_greenblatt comments on The Shutdown Problem: Incomplete Preferences as a Solution

ryan_greenblatt 10 Apr 2024 18:52 UTC
3 points
0

I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’.

Do you think selectively breeding humans for this would result in this rule generalizing? (You can tell them that they should follow this rule if you want. But, if you do this, you should also consider if “telling them should be obedient and then breeding for this” would also work.)

Do you think it’s natural to generalize to extremely unlikely conditionals that you’ve literally never been trained on (because they are sufficiently unlikely that they would never happen)?
- EJT 19 Nov 2024 10:41 UTC
  1 point
  0
  Parent
  I don’t think human selective breeding tells us much about what’s simple and natural for AIs. HSB seems very different from AI training. I’m reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It’s probably hard to get next-token predictors via HSB, but you can do it via AI training.
  On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
  - ryan_greenblatt 19 Nov 2024 22:49 UTC
    2 points
    0
    Parent
    I was asking about HSB not because I think it is similar to the process about AIs but because if the answer differs, then it implies your making some narrower assumption about the inductive biases of AI training.
    
    On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
    
    Sure, from a capabilities perspective. But the question is how the motivations/internal objectives generalize. I agree that AIs trained to be a TD-agent might generalize for the same reason that an AI trained on a paperclip maximization objective might generalize to maximize paperclips in some very different circumstance. But, I don’t necessarily buy this is how the paperclip-maximization-trained AI will generalize!
    
    (I’m picking up this thread from 7 months ago, so I might be forgetting some important details.)