(1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
(2) Another nitpick which maybe is somewhat important: Sometimes you say you are arguing against deceptive alignment, other times you simplify this to “deception quite unlikely.” But your arguments aren’t against deception, they are only against deceptive alignment. If we take your arguments to their logical conclusion, we should expect our models to adopt some sort of reward-maximization as their goal, rather than human values; having done this, whether or not they are deceptive (in the minimal sense of ‘do they sometimes deliberately deceive us about important things’) depends on whether or not we sometimes reward them for lying to us, and probably we will, so QED.
(3) You say:
Gradient descent can only update the model in the direction that improves performance hyper-locally. Therefore, building the effects of future gradient updates into the decision making of the current model would have to be advantageous on the current training batch for it to emerge from gradient descent. Because each gradient update should have only a small impact on model behavior, the relatively short-term reward improvements of considering these effects should be very small. If the model isn’t being trained on goals that extended far past the next gradient update, then learning to consider how current actions affect gradient updates, which is not itself especially consequential, should be very slow.
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Another way a model might gain situational awareness is through the prompt. To give it better context for decisions, researchers will likely prompt it to understand that it is a machine learning model. However, I don’t see why a researcher would want prompt deception-relevant situational awareness. A model could easily understand that it is a model in training without reasoning about how its gradients will affect its future goal. As discussed in the previous paragraph, gradients only have a small impact on short-term goal achievement. Therefore, unless the model has very long-term goals, it will not have a significant incentive to consider the effects of gradient updates. Similarly, researchers should have little incentive to encourage consideration of these effects.
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
In light of that, I’m confused about this paragraph where you discuss prompting.
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.
1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?
I just realized I never responded to this. Sorry. I hope to find time to respond someday… feel free to badger me about it. Curious how you are doing these days and what you are up to.
Comments as I read:
(1) You talk about the base goal, and then the training goal, and then human values/ethics. These aren’t the same thing though right? In fact they will almost certainly be very different things. The base goal will be something like “maximize reward in the next hour or so.” Or maaaaaaybe “Do what humans watching you and rating your actions would rate highly,” though that’s a bit more complicated and would require further justification I think. Neither of those things are anywhere near to human ethics.
(2) Another nitpick which maybe is somewhat important: Sometimes you say you are arguing against deceptive alignment, other times you simplify this to “deception quite unlikely.” But your arguments aren’t against deception, they are only against deceptive alignment. If we take your arguments to their logical conclusion, we should expect our models to adopt some sort of reward-maximization as their goal, rather than human values; having done this, whether or not they are deceptive (in the minimal sense of ‘do they sometimes deliberately deceive us about important things’) depends on whether or not we sometimes reward them for lying to us, and probably we will, so QED.
(3) You say:
Doesn’t this prove too much though? Doesn’t it prove that effective altruist humans are impossible, since they have goals that extend billions of years into the future even though they were created by a process (evolution) that only ever shaped them based on much more local behavior such as what happened to their genes in the next generation or three?
Your definition of deception-relevant situational awareness doesn’t seem like a definition of situational awareness at all. It sounds like you are just saying the model has to be situationally aware AND ALSO care about how gradient updates affect goal attainment afterwards, i.e. be non-myopic?
In light of that, I’m confused about this paragraph where you discuss prompting.
Thanks for the thoughtful feedback both here and on my other post! I plan to respond in detail to both. For now, your comment here makes a good point about terminology, and I have replaced “deception” with “deceptive alignment” in both posts. Thanks for pointing that out!
I’m intentionally not addressing direct reward maximizers in this sequence. I think they are a much more plausible source of risk than deceptive alignment. However, I haven’t thought about them nearly as much, and I don’t have strong intuition for how likely they are yet, so I’m choosing to stay focused on deceptive alignment for this sequence.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?
I just realized I never responded to this. Sorry. I hope to find time to respond someday… feel free to badger me about it. Curious how you are doing these days and what you are up to.