To do something really useful (like nanotech or biological immortality), your model should be something like AlphaZero—model-based score-maximizer. Because this model is really intelligent, it can model future world states and find that if model is turned off, future would have lower score than if model wasn’t turned off.
And yet, AlphaZero is corrigible. It’s goal is not even to win, it’s goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.
Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it’s value function would proably change to aiming for the actual “official” win.
Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents.
There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is “corrigible”
But it seems to me that in these types of models, where the utility function is based on the state of the world rather than on input to the AI, aligning the AI not to kill humanity is easier. Like if an AI gets a reward every time it sees a paperclip, then it seems hard to punish the AI for killing humans because “human dies” is a hard thing for an AI with just sensory input to explicitly recognize. If however the AI is trained on a bunch of runs where the utility function is the number of paperclips actually created, then we can also penalize the model for the number of people who actually die.
I’m not very familiar with these forms of training so I could be off here.
To do something really useful (like nanotech or biological immortality), your model should be something like AlphaZero—model-based score-maximizer. Because this model is really intelligent, it can model future world states and find that if model is turned off, future would have lower score than if model wasn’t turned off.
And yet, AlphaZero is corrigible. It’s goal is not even to win, it’s goal is to play in a way to maximise the chance of winning if the game is played until completion. It does not actually care about if game is completed or not. For example, it does not trick player into playing the game to the end by pretending they have a change of winning.
Though, if it would be trained on parties with real people, and would get better reward for winning than for parties being abandoned by players, it’s value function would proably change to aiming for the actual “official” win.
Corrigibility is a feature of advanced agency, it may not be applied to not advanced enough agents. There is nothing unusual if you turn off your computer, because your computer is not an advanced agent that can resist to be turned off, so there is no reason to tell that your computer is “corrigible”
Yeah so this seems like what I was missing.
But it seems to me that in these types of models, where the utility function is based on the state of the world rather than on input to the AI, aligning the AI not to kill humanity is easier. Like if an AI gets a reward every time it sees a paperclip, then it seems hard to punish the AI for killing humans because “human dies” is a hard thing for an AI with just sensory input to explicitly recognize. If however the AI is trained on a bunch of runs where the utility function is the number of paperclips actually created, then we can also penalize the model for the number of people who actually die.
I’m not very familiar with these forms of training so I could be off here.