I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
Do you agree with all that?
If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
To respond to this:
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.
I have an actual model now that I’ve learned more, so to answer the question below:
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
To answer what algorithm exactly, it could well be the same algorithm that the AI uses for it’s capabilities like MCTS or AlphaZero’s algorithm or a future AGI’s capability algorithms, but the point is that the algorithm matters less than the data, especially as the data gets larger and larger so the really important question is how to make the dataset, and that’s answered in my comment below:
I also want to point out that as it turns out, alignment generalizes farther than capabilities for some pretty deep reasons given below, but short answer, it’s due to both the fact that verifying that your values was satisified is in many cases easier than actually executing those values out in the world, combined with values data being easier to learn than other capabilities data:
1.) Empirically, reward models are often significantly smaller and easier to train than core model capabilities. I.e. in RLHF and general hybrid RL setups, the reward model is often a relatively simple small MLP or even linear layer stuck atop the general ‘capabilities’ model which actually does the policy optimization. In general, it seems that simply having the reward be some simple function of the internal latent states works well. Reward models being effective as a small ‘tacked-on’ component and never being the focus of serious training effort gives some evidence towards the fact that reward modelling is ‘easy’ compared to policy optimization and training. It is also the case that, empirically, language models seem to learn fairly robust understandings of human values and can judge situations as ethical vs unethical in a way that is not obviously worse than human judgements. This is expected since human judgements of ethicality and their values are contained in the datasets they learn to approximate.
2.) Reward modelling is much simpler with respect to uncertainty, at least if you want to be conservative. If you are uncertain about the reward of something, you can just assume it will be bad and generally you will do fine. This reward conservatism is often not optimal for agents who have to navigate an explore/exploit tradeoff but seems very sensible for alignment of an AGI where we really do not want to ‘explore’ too far in value space. Uncertainty for ‘capabilities’ is significantly more problematic since you have to be able to explore and guard against uncertainty in precisely the right way to actually optimize a stochastic world towards a specific desired point.
3.) There are general theoretical complexity priors to believe that judging is easier than generating. There are many theoretical results of the form that it is significantly asymptotically easier to e.g. verify a proof than generate a new one. This seems to be a fundamental feature of our reality, and this to some extent maps to the distinction between alignment and capabilities. Just intuitively, it also seems true. It is relatively easy to understand if a hypothetical situation would be good or not. It is much much harder to actually find a path to materialize that situation in the real world.
4.) We see a similar situation with humans. Almost all human problems are caused by a.) not knowing what you want and b.) being unable to actually optimize the world towards that state. Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa. For the AI, we aim to solve part a.) as a general part of outer alignment and b.) is the general problem of capabilities. It is much much much easier for people to judge and critique outcomes than actually materialize them in practice, as evidenced by the very large amount of people who do the former compared to the latter.
5.) Similarly, understanding of values and ability to assess situations for value arises much earlier and robustly in human development than ability to actually steer outcomes. Young children are very good at knowing what they want and when things don’t go how they want, even new situations for them, and are significantly worse at actually being able to bring about their desires in the world.
In general, it makes sense that, in some sense, specifying our values and a model to judge latent states is simpler than the ability to optimize the world. Values are relatively computationally simple and are learnt as part of a general unsupervised world model where there is ample data to learn them from (humans love to discuss values!). Values thus fall out mostly’for free’ from general unsupervised learning. As evidenced by the general struggles of AI agents, ability to actually optimize coherently in complex stochastic ‘real-world’ environments over long time horizons is fundamentally more difficult than simply building a detailed linguistic understanding of the world.
In essence, what I’m doing here is unifying the capabilities and value reward functions, and pointing out that with total control of the dataset and densely defined rewards, we can prevent a lot of misaligned objectives from appearing, since the algorithm is less important than the data.
I think the key crux is that all of the differences, or almost all of the differences mediate through searching for different data, and if you had the ability to totally control a sociopath’s data sources, they’d learn a different reward function that is way closer to what you want the reward function to be as.
If you had the ability to control people’s data and reward functions as much as ML people did today, you could trivially brainwash them to accept almost arbitrary facts and moralities, and it would be one of the most used technologies in politics.
But for alignment, this is awesome news, because it lets us control what exactly is rewarded, and what their values are like.
I find your text confusing. Let’s go step by step.
AlphaZero-chess has a very simple reward function: +1 for getting checkmate, −1 for opponent checkmate, 0 for draw
A trained AlphaZero-chess has extraordinarily complicated “preferences” (value function) — its judgments of which board positions are good or bad contain millions of bits of complexity
If you change the reward function (e.g. flip the sign of the reward), the resulting trained model preferences would be wildly different.
By analogy:
The human brain has a pretty simple innate reward function (let’s say dozens to hundreds of lines of pseudocode).
A human adult has extraordinarily complicated preferences (let’s say millions of bits of complexity, at least)
If you change the innate reward function in a newborn, the resulting adult would have wildly different preferences than otherwise.
Do you agree with all that?
If so, then there’s no getting around that getting the right innate reward function is extremely important, right?
So, what reward function do you propose to use for a brain-like AGI? Please, write it down. Don’t just say “the reward function is going to be learned”, unless you explain exactly how it’s going to be learned. Like, what’s the learning algorithm for that? Maybe use AlphaZero as an example for how that alleged reward-function-learning-algorithm would work? That would be helpful for me to understand you. :)
I agree with this statement, because the sign change directly inverts the reward, and thus it means the previous reward is now a bad thing to hit for, but my view is that this is probably unreprensentative, and that brains/brain-like AGI are much more robust than you think to changing their value/reward functions (but not infinitely robust.) due to the very simple value function you pointed out.
So I basically disagree with this example representing a major problem with NN/Brain-Like AGI robustness.
To respond to this:
This doesn’t actually matter for my purposes, as I only need the existence of simple reward functions like you claimed to conclude that deceptive alignment is unlikely to happen, and I am leaving it up to the people that are aligning AI like Nora Belrose to actually implement this ideal.
Essentially, I’m focusing on the implications of the existence of simple algorithms for values, and pointing out that various alignment challenges either go away or are far easier to do if we grant that there is a simple reward function for values, which is very much a contested/disagreed position on LW.
So I think we basically agree that there is a simple reward function for values, but I think this implies some other big changes in alignment which reduces the risk of AI catastrophe drastically, mostly via getting rid of deceptive alignment as an outcome that will happen, but there are various other side benefits I haven’t enumerated because it would make this comment too long.
I have an actual model now that I’ve learned more, so to answer the question below:
To answer what algorithm exactly, it could well be the same algorithm that the AI uses for it’s capabilities like MCTS or AlphaZero’s algorithm or a future AGI’s capability algorithms, but the point is that the algorithm matters less than the data, especially as the data gets larger and larger so the really important question is how to make the dataset, and that’s answered in my comment below:
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg#BxNLNXhpGhxzm7heg
I also want to point out that as it turns out, alignment generalizes farther than capabilities for some pretty deep reasons given below, but short answer, it’s due to both the fact that verifying that your values was satisified is in many cases easier than actually executing those values out in the world, combined with values data being easier to learn than other capabilities data:
The link is given below:
https://www.beren.io/2024-05-15-Alignment-Likely-Generalizes-Further-Than-Capabilities/
In essence, what I’m doing here is unifying the capabilities and value reward functions, and pointing out that with total control of the dataset and densely defined rewards, we can prevent a lot of misaligned objectives from appearing, since the algorithm is less important than the data.
I think the key crux is that all of the differences, or almost all of the differences mediate through searching for different data, and if you had the ability to totally control a sociopath’s data sources, they’d learn a different reward function that is way closer to what you want the reward function to be as.
If you had the ability to control people’s data and reward functions as much as ML people did today, you could trivially brainwash them to accept almost arbitrary facts and moralities, and it would be one of the most used technologies in politics.
But for alignment, this is awesome news, because it lets us control what exactly is rewarded, and what their values are like.