Good point. I agree that the wrong model of user’s preferences is my main concern and most alignment thinkers’. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it’s aligned to follow instructions, it doesn’t need nearly as much of a model of the user’s preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like “Okay, I’ve got an approach that should work. I’ll engineer a gene drive to painlessly eliminate the human population”. “Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let’s try another approach that accomplishes that too...”. I describe this as do-what-I-mean-and-check, DWIMAC.
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Also, accurately modeling short-term intent—what the user wants right now—seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it’s also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there’s still two advantages to modeling just one person’s values instead of all of humanity’s. The smaller one is that you don’t need to understand as many people or figure out how to aggregate values that conflict with each other. I think that’s not actually that hard since lots of compromises could give very good futures, but I haven’t thought that one alal the way through. The bigger advantage is that one person can say “oh my god don’t do that it’s the last thing I want” and it’s pretty good evidence for their true values. Humanity as a whole probably won’t be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
Doesn’t easier to build mean lower alignment tax?
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it’s still desirable).
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
Re this:
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it’s still desirable).