But a particular regime I’m worried about (for both PIA & VA) is when the AI has an imperfect model of the users’ goals inside its world model & optimizes for them. One way to mitigate this is to implement some systematic ways of avoiding side effects (corrigibility/impact regularization) so that we won’t need to require a perfect optimization target, another is to allow the AI to update its goals as it improves its model of the users’ values.
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:
The biggest reason I wouldn’t be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you’re talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.
More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.
Yes, I think synthetic data could be useful for improving the world model. It’s arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd’s comments.
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Re today’s AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself:
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
’3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
Re this:
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
Agreed
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X⟹alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one advantage of infrabayesianism
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
Then we’ve converged almost completely, thanks for the conversation.
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X ⟹ alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one advantage of infrabayesianism
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Then we’ve converged almost completely, thanks for the conversation.
Thanks! I enjoyed the conversation too.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
Agreed.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
This comment is to clarify some things, not to disagree too much with you:
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks for clarifying that, now I understand what you’re saying.
I agree with both of those methods being used, and IMO a third way or maybe a way to improve the 2 methods is to use synthetic data early and fast on human values and instruction following, and the big reason for using synthetic data here is to both improve the world model, plus using it to implement stuff like instruction following by making datasets where superhuman AI always obey the human master despite large power differentials, or value learning where we offer large datasets of AI always faithfully acting on the best of our human values, as described here:
https://www.beren.io/2024-05-11-Alignment-in-the-Age-of-Synthetic-Data/
The biggest reason I wouldn’t be too concerned about that problem is that assuming no deceptive alignment/adversarial behavior from the AI, which is likely to be enforced by synthetic data, I think the problem you’re talking about is likely to be solvable in practice, because we can just make their General Purpose Search/world model more capable without causing problems, which means we can transform this in large parts into a problem that goes away with scale.
More generally, this unlocks the ability to automate the hard parts of alignment research, which lets us offload most of the hard work onto the AI.
Yes, I think synthetic data could be useful for improving the world model. It’s arguable that allowing humans to select/filter synthetic data for training counts as a form of active learning, because the AI is gaining information about human preference through its own actions (generating synthetic data for humans to choose). If we have some way of representing uncertainties over human values, we can let our AI argmax over synthetic data with the objective of maximizing information gain about human values (when synthetic data is filtered).
I think using synthetic data for corrigibility can be more or less effective depending on your views on corrigibility and the type of AI we’re considering. For instance, it would be more effective under Christiano’s act-based corrigibility because we’re avoiding any unintended optimization by evaluating the agent at the behavioral level (sometimes even at thought level), but in this paradigm we’re basically explicitly avoiding general purpose search, so I expect a much higher alignment tax.
If we’re considering an agentic AI with a general purpose search module, misspecification of values is much more susceptible to goodhart failures because we’re applying much more optimization pressure, & it’s less likely that synthetic data on corrigibility can offer us sufficient robustness, especially when there may be systematic bias in human filtering of synthetic data. So in this context I think a value-free core of corrigibility would be necessary to avoid the side effects that we can’t even think of.
Note that whenever I say corrigibility, I really mean instruction following, ala @Seth Herd’s comments.
Re the issue of Goodhart failures, maybe a kind of crux is how much do we expect General Purpose Search to be aimable by humans, and my view is that we will likely be able to get AI that is both close enough to our values plus very highly aimable because of the very large amounts of synthetic data, which means we can put a lot of optimization pressure, because I view future AI as likely to be quite correctable, even in the superhuman regime.
Another crux might be that I think alignment probably generalizes further than capabilities, for the reasons sketched out by Beren Millidge here:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
I also expect general purpose search to be aimable, in fact, it’s selected to be aimable so that the AI can recursively retarget GPS on instrumental subgoals
I think there’s a fundamental tradeoff between optimization pressure & correctability, because if we apply a lot of optimization pressure on the wrong goals, the AI will prevent us from correcting it, and if the goals are adequate we won’t need to correct them. Obviously we should lean towards correctability when they’re in conflict, and I agree that the amount of optimization pressure that we can safely apply while retaining sufficient correctability can still be quite high (possibly superhuman)
Yes, I consider this to be the central crux.
I think current models lack certain features which prevent the generalization of their capabilities, so observing that alignment generalizes further than capabilities for current models is only weak evidence that it will continue to be true for agentic AIs
I also think an adequate optimization target about the physical world is much more complex than a reward model for LLM, especially because we have to evaluate consequences in an alien ontology that might be constantly changing
This is what I was trying to say, that the tradeoff is in certain applications like automating AI interpretability/alignment research is not that harsh, and I was saying that a lot of the methods that make personal intent/instruction following AGIs feasible allow you to extract optimization that is hard and safe enough to use iterative methods to solve the problem.
I kind of agree, at least at realistic compute levels say through 2030, lack of search is a major bottleneck to better AI, but a few things to keep mind:
People at OpenAI are absolutely trying to integrate search into LLMs, see this example where they got the Q* algorithm that aced a math test:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type
Also, I don’t buy that it was refuted, based on this, which sounds like a refutation but isn’t actually a refutation, and they never directly deny it:
https://www.lesswrong.com/posts/JnM3EHegiBePeKkLc/possible-openai-s-q-breakthrough-and-deepmind-s-alphago-type#ECyqFKTFSLhDAor7k
Re today’s AIs being weak evidence for alignment generalizes further than capabilities, I think that the theoretical reasons and empirical reasons for why alignment generalizes further than capabilities is in large part (but not the entire story) reducible to why it’s generally much easier to verify that something has been done correctly than actually executing the plan yourself:
Re this:
I do think this means we will definitely have to get better at interpretability, but the big reason I think this matters less than you think is probably due to being more optimistic about the meta-plan for alignment research, due to both my models of how research progress works, plus believing that you can actually get superhuman performance at stuff like AI interpretability research and still have instruction following AGIs/ASIs.
More concretely, I think that the adequate optimization target is actually deferrable, because we can mostly just rely on instruction following and not have to worry too much about adequate optimization targets for the physical world, since we can use the first AGIs/ASIs to do interpretability and alignment research that help us reveal what optimization targets to choose for.
Agreed
Interesting, I do expect GPS to be the main bottleneck for both capabilities and inner alignment
Agreed, but I think the main bottleneck is crossing the formal-informal bridge, so it’s much harder to come up with a specification X such that X⟹alignment but once we have such a specification it’ll be much easier to come up with an implementation (likely with the help of AI)
Yes, I think optimizing worst-case performance is one crucial part of alignment, it’s also one
advantage of infrabayesianism
Yes, I agree that accelerated/simulated reflection is a key hope for us to interpret an alien ontology, especially if we can achieve something like HRH that helps us figure out how to improve automated interpretability itself. I think this would become safer & more feasible if we have an aimable GPS and a modular world model that supports counterfactual queries (as we’d get to control the optimization target for automating interpretability without worrying about unintended optimization).
Then we’ve converged almost completely, thanks for the conversation.
So you’re saying that conditional on GPS working, both capabilities and inner alignment problems are solved or solvable, right?
While I agree that formal proof is probably the case with the largest divide in practice, the verification/generation gap applies to a whole lot of informal fields as well, like research, engineering of buildings and bridges, and more,
I agree though if we had a reliable way to do cross the formal-informal bridge, it would be very helpful, I was just making a point about how pervasive the verification/generation gap is.
My main thoughts on infrabayesianism is that while it definitely interesting, and I do like quite a bit of the math and results, right now the monotonicity principle is a big reason why I’m not that comfortable with using infrabayesianism, even if it actually worked.
I also don’t believe it’s necessary for alignment/uncertainty either.
I wasn’t totally thinking of simulated reflection, but rather automated interpretability/alignment research.
Yeah, a big thing I admit to assuming is that I’m assuming that the GPS is quite aimable by default, due to no adversarial cognition, at least for alignment purposes, but I want to see your solution first, because I still think this research could well be useful.
Thanks! I enjoyed the conversation too.
yes, I think inner alignment is basically solved conditional on GPS working, for capabilities I think we still need some properties of the world model in addition to GPS.
Agreed.
yes, the monotonicity principle is also the biggest flaw of infrabayesianism IMO, & I also don’t think it’s necessary for alignment (though I think some of their results or analogies of their results would show up in a full solution to alignment).
I intended “simulated reflection” to encompass (a form of) automated interpretability/alignment research, but I should probably use a better terminology.
Thanks!
This comment is to clarify some things, not to disagree too much with you:
Then we’d better start cracking on how to get GPS into LLMs.
Re world modeling, I believe that while LLMs do have a world model in at least some areas, I don’t think it’s all that powerful or all that reliable, and IMO the meta-bottleneck on GPS/world modeling is that they were very compute expensive back in the day, and as compute and data rise, people will start trying to put GPS/world modeling capabilities in LLMs and succeeding way more compared to the past.
And I believe that a lot of the world modeling stuff will start to become much more reliable and powerful as a result of scale and some early GPS.
Perhaps so, though I’d bet on synthetic data/automated interpretability being the first way we practically get a full solution to alignment.
Thanks for clarifying that, now I understand what you’re saying.