But more recently I’ve been thinking that neither will be a real issue, because Instruction-following AGI is easier and more likely than value aligned AGI. The obvious solution to both alignment stability and premature/incorrect/mis-specified value lock-in is to keep a human in the loop by making AGI whose central goal is to follow instructions (or similar personal intent alignment) from authorized user(s).
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
But a particular regime I’m worried about (for both PIA & VA) is when the AI has an imperfect model of the users’ goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
Instruction-following AI can also help with this, though I think it might imply a higher alignment tax, but it’s also probably easier to build.
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
Good point. I agree that the wrong model of user’s preferences is my main concern and most alignment thinkers’. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it’s aligned to follow instructions, it doesn’t need nearly as much of a model of the user’s preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like “Okay, I’ve got an approach that should work. I’ll engineer a gene drive to painlessly eliminate the human population”. “Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let’s try another approach that accomplishes that too...”. I describe this as do-what-I-mean-and-check, DWIMAC.
The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it’s got a more elaborate model of the user’s preferences to help in interpreting instructions correctly, and it’s supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Also, accurately modeling short-term intent—what the user wants right now—seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it’s also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there’s still two advantages to modeling just one person’s values instead of all of humanity’s. The smaller one is that you don’t need to understand as many people or figure out how to aggregate values that conflict with each other. I think that’s not actually that hard since lots of compromises could give very good futures, but I haven’t thought that one alal the way through. The bigger advantage is that one person can say “oh my god don’t do that it’s the last thing I want” and it’s pretty good evidence for their true values. Humanity as a whole probably won’t be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it’s got a more elaborate model of the user’s preferences to help in interpreting instructions correctly, and it’s supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Note that the link to the Harms version of corrigibility doesn’t work.
Good point. I agree that the wrong model of user’s preferences is my main concern and most alignment thinkers’. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it’s aligned to follow instructions, it doesn’t need nearly as much of a model of the user’s preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like “Okay, I’ve got an approach that should work. I’ll engineer a gene drive to painlessly eliminate the human population”. “Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let’s try another approach that accomplishes that too...”. I describe this as do-what-I-mean-and-check, DWIMAC.
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Also, accurately modeling short-term intent—what the user wants right now—seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it’s also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there’s still two advantages to modeling just one person’s values instead of all of humanity’s. The smaller one is that you don’t need to understand as many people or figure out how to aggregate values that conflict with each other. I think that’s not actually that hard since lots of compromises could give very good futures, but I haven’t thought that one alal the way through. The bigger advantage is that one person can say “oh my god don’t do that it’s the last thing I want” and it’s pretty good evidence for their true values. Humanity as a whole probably won’t be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
Doesn’t easier to build mean lower alignment tax?
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it’s still desirable).
But a particular regime I’m worried about (for both PIA & VA) is when the AI has an imperfect model of the users’ goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:
The biggest reason I wouldn’t be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you’re talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.
More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.
Yes, I think synthetic data could be useful for improving the world model. It’s arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd’s comments.
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Re today’s AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself:
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
’3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
Re this:
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X⟹alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
Then we’ve converged almost completely, thanks for the conversation.
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X ⟹ alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one advantage of infrabayesianism
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Then we’ve converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
This comment is to clarify some things, not to disagree too much with you:
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks for clarifying that, now I understand what you’re saying.
Good point!
I think this argument also extends to value-aligned AI, because the value-aligned AGI will keep humans in the loop insofar as we want to be kept in the loop, & it will be corrigible insofar as we want it to be corrigible.
But a particular regime I’m worried about (for both PIA & VA) is when the AI has an imperfect model of the users’ goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
Instruction-following AI can also help with this, though I think it might imply a higher alignment tax, but it’s also probably easier to build.
Good point. I agree that the wrong model of user’s preferences is my main concern and most alignment thinkers’. And that it can happen with a personal intent alignment as well as value alignment.
This is why I prefer instruction-following to corrigibility as a target. If it’s aligned to follow instructions, it doesn’t need nearly as much of a model of the user’s preferences to succeed. It just needs to be instructed to talk through its important actions before executing, like “Okay, I’ve got an approach that should work. I’ll engineer a gene drive to painlessly eliminate the human population”. “Um okay, I actually wanted the humans to survive and flourish while solving cancer, so let’s try another approach that accomplishes that too...”. I describe this as do-what-I-mean-and-check, DWIMAC.
The Harms version of corrigibility is pretty similar in that it should take instructions first and foremost, even though it’s got a more elaborate model of the user’s preferences to help in interpreting instructions correctly, and it’s supposed to act on its own initiative in some cases. But the two approaches may converge almost completely after a user has given a wise set of standing instructions to their DWIMAC AGI.
Also, accurately modeling short-term intent—what the user wants right now—seems a lot more straightforward than modeling the deep long-term values of all of humanity. Of course, it’s also not as good a way to get a future that everyone likes a lot. This seems like a notable difference but not an immense one; the focus on instructions seems more important to me.
Absent all of that, it seems like there’s still two advantages to modeling just one person’s values instead of all of humanity’s. The smaller one is that you don’t need to understand as many people or figure out how to aggregate values that conflict with each other. I think that’s not actually that hard since lots of compromises could give very good futures, but I haven’t thought that one alal the way through. The bigger advantage is that one person can say “oh my god don’t do that it’s the last thing I want” and it’s pretty good evidence for their true values. Humanity as a whole probably won’t be in a position to say that before a value-aligned AGI sets out to fulfill its (misgeneralized) model of their values.
Doesn’t easier to build mean lower alignment tax?
Note that the link to the Harms version of corrigibility doesn’t work.
Thank you! Fixed.
Yes, I also think that is a consideration in favor of instruction following. I think there’s an element of IF which I find appealing, it’s somewhat similar to bayesian updating: When I tell an IF agent to “fill the cup”, on one hand it will try to fulfill that goal, but it also thinks about the “usual situation” where that instruction is satisfied, & it will notice that the rest of the world remains pretty much unchanged, so it will try to replicate that. We can think of the IF agent as having a background prior over world states, and it conditions that prior on our instructions to get a posterior distribution over world states, & that’s the “target distribution” that it’s optimizing for. So it will try to fill the cup, but it wouldn’t build a dyson sphere to harness energy & maximize the probability of the cup being filled, because that scenario has never occurred when a cup has been filled (so that world has low prior probability).
I think this property can also be transferred to PIA and VA, where we have a compromise between “desirable worlds according to model of user values” and “usual worlds”.
Agreed, I also favor personal intent alignment for those reasons, or at least I consider PIA + accelerated & simulated reflection to be the most promising path towards eventual VA
It’s part of it, but alignment tax also includes the amount of capabilities that we have to sacrifice to ensure that the AI is safe. The way I think of alignment tax is that for every optimization target, there is an upper bound on the optimization pressure that we can apply before we run into goodhart failures. The closer the optimization target is to our actual values, the more optimization pressure we get to safely apply. & because each instruction only captures a small part of our actual values, we have to limit the amount of optimization pressure we apply (this is also why we need to avoid side effects when the AI has an imperfect model of the users’ preferences).
Re this:
We can also get more optimization if we have better tools to aim General Purpose Search more so that we can correct the model if it goes wrong.
Yes, I think having an aimable general purpose search module is the most important bottleneck for solving inner alignment
I think things can still go wrong if we apply too much optimization pressure to an inadequate optimization target because we won’t have a chance to correct the AI if it doesn’t want us to (I think adding corrigibility is a form of reducing optimization pressure, but it’s still desirable).
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
The biggest reason I wouldn’t be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you’re talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.
More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.
Yes, I think synthetic data could be useful for improving the world model. It’s arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd’s comments.
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type
Also, I don’t buy that it was refuted, based on this, which sounds like a refutation but isn’t actually a refutation, and they never directly deny it:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k
Re today’s AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself:
Re this:
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.
Agreed
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X⟹alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one
advantage of infrabayesianism
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
Then we’ve converged almost completely, thanks for the conversation.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks! I enjoyed the conversation too.
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Agreed.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks!
This comment is to clarify some things, not to disagree too much with you:
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
Thanks for clarifying that, now I understand what you’re saying.