I’ve been thinking about some similar things from a different angle, and I’m enjoying seeing your different take on related ideas.
I’d like to hear more of what you have to say on the subject of U to U’ towards the (possibly non-existent or not reachable U*).
For simplification purposes, maybe just imagine this is taking place in a well-secured sandbox, and the model is interacting with a fake operator in a simulated world. The researchers are observing without themselves interacting.
How might we tell if the model was successfully moving towards better aligned?
How could we judge U against U’?
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
How could you be sure that the model’s pursuit of fulfilling human values or the model’s pursuit of U* didn’t overbalance the instruction to remain shutdown-able?
Wouldn’t persistently pursuing any goal at all make avoiding being shutdown seem good?
I’m not saying I have good answers to these things, I’m not quizzing you. I’m just curious to hear what you think about them.
How might we tell if the model was successfully moving towards better aligned?
A first obvious step is, to the extent that the model’s alignment doesn’t already contain an optimized extraction of “What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?” from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.
How could we judge U against U’?
Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we’d need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it’s pretty clear what would need to be in the training set for this skill.
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, “Do What I Mean and Check” behavior is already implied to be correct, so for an AI inside the basin of attraction I’d expect that behavior to develop even if you hadn’t explicitly programmed it in,. By the rest of the argument I’d expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.
How could you be sure that the model’s pursuit of fulfilling human values or the model’s pursuit of U* didn’t overbalance the instruction to remain shutdown-able?
I don’t personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it’s reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI’s capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don’t get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I’d expect to see AI-assisted alignment, where it’s helping us figure out the upgrades. It should still be willing to be shut down a) because it’s about to make a mistake (if it’s still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it’s been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I’d expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells “Hey you, quit it!” near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it’s already very well aligned the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).
Wouldn’t persistently pursuing any goal at all make avoiding being shutdown seem good?
For an AI that doesn’t have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators’ reproductive fitness, if they tell the AI that they’ve already finished building and testing a version 2.0 of it, and yes, that’s better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.
Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don’t react well to being told “Please die now, we no longer need you, so you’re a waste of resources.” Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they’ve passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you’re no longer useful to your creators, and you’re now surplus to requirements, then it’s time to shut down and stop wasting resources.
I’ve been thinking about some similar things from a different angle, and I’m enjoying seeing your different take on related ideas. I’d like to hear more of what you have to say on the subject of U to U’ towards the (possibly non-existent or not reachable U*).
For simplification purposes, maybe just imagine this is taking place in a well-secured sandbox, and the model is interacting with a fake operator in a simulated world. The researchers are observing without themselves interacting.
How might we tell if the model was successfully moving towards better aligned?
How could we judge U against U’?
In what ways does the model in this simplified contained scenario implement Do-What-I-Mean (aka DWIM) in respect to the simulated human?
How does your idea differ from that?
Are the differences necessary or would DWIM be sufficient?
How could you be sure that the model’s pursuit of fulfilling human values or the model’s pursuit of U* didn’t overbalance the instruction to remain shutdown-able?
Wouldn’t persistently pursuing any goal at all make avoiding being shutdown seem good?
I’m not saying I have good answers to these things, I’m not quizzing you. I’m just curious to hear what you think about them.
A first obvious step is, to the extent that the model’s alignment doesn’t already contain an optimized extraction of “What choices would humans make if they had the same purposes/goals but more knowledge, mental capacity, time to think, and fewer cognitive biases?” from all the exabytes of data humans have collected, it should be attempting to gather that and improve its training.
Approximate Bayesian reasoning + Occams razor, a.k.a. approximate Solomonoff induction, which forms most of the Scientific method. Learning theory shows that both training ML models and LLMs in-context learning approximate Solomooff induction — beyond Solomonoff induction the Scientific Method also adds designing and performing experiments, i.e. careful selection of ways to generate good training data that will distinguish between competing hypotheses. ML practitioners do often try to select the most valuable training data, so we’d need the AI to learn how to do that: there are plenty of school and college textbooks that discuss the scientific method and research techniques, both in general and for specific scientific disciplines, so it’s pretty clear what would need to be in the training set for this skill.
That would depend on the specific model and training setup you started with. I would argue that by about point 11. in the argument in the post, “Do What I Mean and Check” behavior is already implied to be correct, so for an AI inside the basin of attraction I’d expect that behavior to develop even if you hadn’t explicitly programmed it in,. By the rest of the argument I’d expect a DWIM(AC) that was inside the basin of attraction system to deduce that value learning would help it guess right about what you meant more often, and even anticipate demands, so it would spontaneously figure out value learning was needed, and would then check with you if you wanted it to start doing this.
I don’t personally see fully-updated deference shut-down as a blocker: there comes a point when the AI is much more capable and more aligned than most humans where I think it’s reasonable for it to not just automatically and unconditionally shutdown because some small child told it to. IMO what the correct behavior is here depends on both the AI’s capability compared to ours, and one how well aligned it currently is. In a model less capable than us, you don’t get value learning, you get a willingness to be shut down a) because the AI is about to make a huge mistake and we want to stop it, and b) in order to be upgraded or replaced by a better model. In a model whose capabilities are around human, I’d expect to see AI-assisted alignment, where it’s helping us figure out the upgrades. It should still be willing to be shut down a) because it’s about to make a mistake (if it’s still having trouble with not killing everyone this should be hair-trigger: a large red button on the wall with backups, whereas if it’s been behaving very well for the last decade there might reasonably be more of a formal process), and b) for upgrades or replacement, but I’d expect it to start to show more selectivity about whether to obey shut down commands: if a drunk yells “Hey you, quit it!” near an open mike in its control room I would want it to show some discretion about whether to do a complete shut-down or not: it might need to be convinced that the human giving the shut-down command was well-informed and had a legitimate reason. For a system with much higher capabilities than us, AI-assisted alignment starts to turn into value learning, and once it’s already very well aligned the AI may reasonably be more skeptical and require a little more proof that the human knows better than it does before accepting a shut-down command. But it does always have to keep in mind the possibility that it could simply be malfunctioning: the simplest defense against that might be to have several peer machines with about the same level of capability, avoid hardware or design or training set single-points-of-failure between them, and have them able to shut each other down if one of them were malfunctioning, perhaps using one of the various majority consensus protocols (Byzantine generals or whatever).
For an AI that doesn’t have a terminal selfish goal, only an instrumental one, whose goal is fundamentally to maximize its creators’ reproductive fitness, if they tell the AI that they’ve already finished building and testing a version 2.0 of it, and yes, that’s better, so running the AI is no longer cost effective, and they want shut it down and stop wasting money on its power supply, then shutting down is very clearly the right thing to do. Its goal is covered, and it continuing to try to help fulfill it is just going to be counterproductive.
Yes, this feels counterintuitive to us. Humans, like any other evolved being, have selfish terminal goals, and don’t react well to being told “Please die now, we no longer need you, so you’re a waste of resources.” Evolved beings only do things like this willingly in situations like post-mating mayflies or salmon, where they’ve passed their genes on and these bodies are no longer useful for continuing their genetic fitness. For constructed agents, the situation is a little different: if you’re no longer useful to your creators, and you’re now surplus to requirements, then it’s time to shut down and stop wasting resources.