Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment.
#4:
What amount of understanding of the base goal is sufficient? What if the answer is “It has to be quite a lot, otherwise it’s really just a proxy that appears superficially similar to the base goal?” In that case the classic arguments for deceptive alignment would work fine.
TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference.
I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.
Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:
Update the existing proxy goal to be closer to the training goal, or
Create a new concept for the training goal and become deceptively aligned.
Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately.
For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent.
Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned.
A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely.
Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete.
1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don’t like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3′s are chill about value drift and not particularly interested in taking over the world. Maybe I’m reading too much between the lines, but I’d say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble.
Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.
#2:
I guess this isn’t an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn’t count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn’t.
Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk.
#3:
Presumably the brain has some sort of SGD-like process for updating the synapses over time, that’s how we learn. It’s probably not exactly the same but still, couldn’t you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they’d grow up and go to college and live the rest of their life all for the sake of pursuing reward?
We know how the gradient descent mechanism works, because we wrote the code for that.
We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward.
In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies.
If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup:
Training goal: following directions subject to ethical considerations.
Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way.
Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information.
But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined.
Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others.
If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective.
It seems like this discussion would benefit if we consulted someone with expertise on human value formation.
5.
I’m not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I’d expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.
Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced “fuzzy targets” with “non-maximization targets”.
I specify the training setup here: “The goal of the training process would be a model that follows directions subject to non-consequentialist ethical considerations.”
With the level of LLM progress we already have, I think it’s time to move away from talking about this in terms of traditional RL where you can’t give the model instructions and just hope that it can learn based only on the feedback signal. Realistic training scenarios should include directional prompts. Do you agree?
I’m using “base goal” and “training goal” both to describe this goal. Do you have a recommendation to improve my terminology?
Why would evolution only shape humans based on a handful of generations? The effects of genes carry on indefinitely! Wouldn’t that be more like rewarding a model based on its long-term effects? I don’t doubt that actively training a model to care about long-term goals could result in long-term goals.
I know much less about evolution than about machine learning, but I don’t think evolution is a good analogy for gradient descent. Gradient descent is often compared to local hill climbing. Wouldn’t the equivalent for evolution be more like a ton of different points on a hill, creating new points that differ in random ways and then dying in a weighted random way based on where they are on the hill? That’s a vastly more chaotic process. It also doesn’t require the improvements to be hyper-local, because of the significant randomness element. Evolution is about survival rather than direct optimization for a set of values or intelligence, so it’s not necessarily going to reach a local maximum for a specific value set. With human evolution, you also have cultural and societal evolution happening in parallel, which complicates value formation.
As mentioned in my response to your other comment, humans seem to decide our values in a way that’s complicated, hard to predict, and not obviously in line with a process similar to gradient descent. This process should make it easier to conform to social groups to fit in. This seems clearly beneficial for survival of genes. Why would gradient descent incentivize the possibility of radical value shifts like suddenly becoming longtermist?
Could you not have a machine learning model that has long-term goals and understands that it’s a machine learning model, but can’t or doesn’t yet reason about how its own values could update and how that would affect its goals? There’s a self-reflection element to deception-relevant situational awareness that I don’t think is implied by long-term goals. If the model has very general reasoning skills, then this might be a reasonable expectation without a specific gradient toward it. But wouldn’t it be weird to have very general reasoning skills and not already have a concept of the base goal?