Daniel Kokotajlo comments on Order Matters for Deceptive Alignment

Daniel Kokotajlo Feb 28, 2023, 3:53 PM
12 points
2
OK, now a more substantive reply since I’ve gotten a chance to read more carefully. Comments as I read. Rambly:

1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don’t like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3′s are chill about value drift and not particularly interested in taking over the world. Maybe I’m reading too much between the lines but I’d say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble.

2. I like the point that if the model already has a good understanding of the base goal / base objective before it becomes goal-directed, SGD will probably just build a pointer to the base goal rather than building a pointer to a proxy and then letting instrumental convergence + deception do the rest.

In a realistic training scenario though, the base goal will be misaligned, right? For example, in RLHF, there’ll be biases and dogmas in the minds of the human data-providers, such that often they’ll reward the model for lying or doing harmful things, and punish the model for telling the truth or doing something helpful. And while some of these errors will be noise, others will be systematically predictable. (And then there’s the added complication of the reward model and the fact that there’s a reward counter on a GPU somewhere.) So, suppose the model has an understanding of all of these things from pre-training, and then becomes agentic during fine-tuning, won’t it probably end up with a goal like “maximize this number on these GPUs” or “Do what makes the reward model most light up” or “Do what gets high ratings from this group of humans” (sycophancy).

I guess this isn’t an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn’t count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn’t. (I guess if it was sufficiently myopic, and the humans were smart enough to give it lots of reward when it admitted to being misaligned, then it wouldn’t lie about this. But this situation wouldn’t persist long I think.)

3. Wait a minute. Why doesn’t this happen in humans? Presumably the brain has some sort of SGD-like process for updating the synapses over time, that’s how we learn. It’s probably not exactly the same but still, couldn’t you run the same argument, and get a prediction that e.g. if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they’d grow up and go to college and live the rest of their life all for the sake of pursuing reward? (I guess something like this does happen sometimes; plenty of people seem to have their own happiness as their only final goal, and some people even seem to be fairly myopic about it. And I guess you could argue that humans become somewhat goal-directed in infancy, before they are smart enough to learn even the roughest pointer to happiness/reward/etc. But I don’t think either of these responses is strong.)

4. What amount of understanding of the base goal is sufficient? What if the answer is “It has to be quite a lot, otherwise it’s really just a proxy that appears superficially similar to the base goal?” In that case the classic arguments for deceptive alignment would work fine.

I think the crux lies somewhere around here. Maybe a thing to investigate is: How complicated is the circuitry for “X, whatever that turns out to mean” compared to the circuitry for X itself? For example: Let X = “reward over the next hour or so” and X’ = “Time-discounted reward with discount rate R, for [name of particular big model on particular date] as defined in page 27 of [textbook on ML].”

X’ is a precise, more fleshed-out and well-defined concept than X. But maybe it’s in some sense ‘what X turns out to mean.’ In other words there’s some learning process, some process of conceptual refinement, that starts with X and ends with X’. And the model more or less faithfully carries out this process. And at first when the model is young and dumb it has the concept of X but not the concept of X’, and that’s the situation it’s in when it starts to form coherent goals, and then later when it’s much smarter it’ll have morphed X into X’. And when the model is young and dumb and X is its goal, it doesn’t pursue X in ways that would get in the way of this learning/morphing process. It doesn’t want to “lock in” X in any sense.

If this turns out to be a fairly elegant/simple/straightforward way for a mind to work, then great, I think your overall story is pretty plausible. But what if it’s messy and complicated to get something like this? Then at some point the model forms goals, and they’ll look something like X rather than like X’, and then they’ll keep X as their goal forever (and/or maybe it’ll be precisified in some way that is different from the ‘intended’ way that a human would precisify it, which amounts to the same thing because it means that unless the model has a perfect understanding of the base objective when it first develops goals, it’ll probably never have a goal which is the same as the base objective.)

5. I’m not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I’d expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.
- DavidW Mar 2, 2023, 2:48 PM
  8 points
  0
  Parent
  Thanks for your thoughtful reply! I really appreciate it. I’m starting with your fourth point because I agree it is closest to the crux of our disagreement, and this has become a very long comment.
  #4:
  What amount of understanding of the base goal is sufficient? What if the answer is “It has to be quite a lot, otherwise it’s really just a proxy that appears superficially similar to the base goal?” In that case the classic arguments for deceptive alignment would work fine.
  TL;DR the model doesn’t have to explicitly represent “X, whatever that turns out to mean”, it just has to point at its best estimate of X` and that will update over time because the model doesn’t know there’s a difference.
  I propose that the relevant factor here is whether the model’s internal goal is the closest thing it has to a representation of the training goal (X). I am assuming that models will have their goal information and decision parameters stored in the later layers and the world modeling overwhelmingly before the decision-making, because it doesn’t make much sense for a model to waste time world modeling (or anything else) after it makes its decision. I expect the proxy to be calculated based on high-level concepts from the world model, not separately from the world model.
  Suppose for the sake of argument that we have a model with an exceptionally long-term goal and situational awareness. However, its internal goal is its flawed representation (X) of the training goal (X`). This model can’t tell the difference between the training goal and the internal proxy goal, so it can’t be deceptively aligned yet. If it performs worse than it could have on a training example because of this, the gradients could:
  1. Update the existing proxy goal to be closer to the training goal, or
  2. Create a new concept for the training goal and become deceptively aligned.
  Updating the existing goal will be a much simpler and more likely fix than creating an entirely new concept and switching the goal to point at that new concept. That new concept would have to be a better representation of the training goal on the current training example than the existing concept in order to enable deceptive alignment, and therefore better training. So, I wouldn’t expect a hyper-local update to be sufficient to make the model deceptively aligned on its own. The gradients therefore have no reason to point toward creating this new concept. On the other hand, tweaks to improve the existing understanding of the training goal would improve performance immediately.
  For example, let’s say we are training this model to follow directions from the prompt, unless they violate ethical norms (X`). The model understands that the training goal is to follow directions subject to some ethical norms but does not understand those norms well enough to apply them correctly in all situations (X). Because it doesn’t understand the difference, it can’t act aligned for instrumental purposes, and it will get suboptimal reward when flaws in its understanding of the training goal affect its behavior. When it messes up in this way, the gradients should point toward improving the internal goal (X) to better match the training goal (X`). On the other hand, a hyper-local move toward a separate, more accurate concept of the training goal would be insufficient to enable deception on its own and will therefore not be learned by gradient descent.
  Now consider an alternate version of the above example, where the model has the same flawed concept for the training goal, but its internal goal is maximizing paperclips (or anything else that’s represented by a separate concept from its base goal representation). This model can immediately become deceptively aligned, because optimizing for its flawed understanding of the training goal will get better training performance than optimizing for paperclips in the short run. The model can then continue learning about the training goal without systematically changing its internal goals and remain deceptively aligned.
  A model that is pre-trained on approximately the whole internet should start with concepts relevant to understanding the training goal. It would be a bit weird if such a pre-trained model did not have a solid, but imperfect, understanding of following directions and key ethical concepts. Early, short-term reward training should be able to point at those and refine the resulting concept. This should be the closest concept to the training goal, so it should fit better with my first example than my second. This would make deceptive alignment very unlikely.
  Other than direct reward optimizers, I have trouble imagining what alternate proxy concept would be correlated enough with following directions subject to ethical considerations that it would be the internal goal late enough in the process for the model to have a long-term goal and situational awareness. Can you think of one? Having a more realistic idea for a proxy goal might make this discussion more concrete.
  1. Minor: I might quibble a bit with your distinction between models of type 3 and models of type 4. What I don’t like is that you imply that humans tend to be mostly type 3 (with the exception, I presume, of hardcore utilitarians) and you also imply that type 3′s are chill about value drift and not particularly interested in taking over the world. Maybe I’m reading too much between the lines, but I’d say that if the AGIs we build are similar to humans in those metrics, humanity is in deep trouble.
  Interesting. I think the vast majority of humans are more like satisficers than optimizers. Perhaps that describes what I’m getting at in bucket 3 better than fuzzy targets. As mentioned in the post, I think level 4 here is the most dangerous, but 3 could still result in deceptive alignment if the foundational properties developed in the order described in this post. I agree this is a minor point, and don’t think it’s central to any disagreements. See also my answer to your fifth point, which has prompted an update to my post.
  #2:
  I guess this isn’t an objection to your post, since deceptive alignment is (I think?) defined in such a way that this wouldn’t count, even though the model would probably be lying to the humans and pretending to be aligned when it knows it isn’t.
  Yeah, I’m only talking about deceptive alignment and want to stay focused on that in this sequence. I’m not arguing against all AI x-risk.
  #3:
  Presumably the brain has some sort of SGD-like process for updating the synapses over time, that’s how we learn. It’s probably not exactly the same but still, couldn’t you run the same argument, and get a prediction that e.g., if we taught our children neuroscience early on and told them about this reward circuitry in their brain, they’d grow up and go to college and live the rest of their life all for the sake of pursuing reward?
  We know how the gradient descent mechanism works, because we wrote the code for that.
  We don’t know how the mechanism for human value learning works. The idea that observed human value learning doesn’t match up with how gradient descent works is evidence that gradient descent is a bad analogy for human learning, not that we misunderstand the high-level mechanism for gradient descent. If gradient descent were a good way to understand human learning, we would be able to predict changes in observed human values by reasoning about the training process and how reward updates. But accurately predicting human behavior is much harder than that. If you try to change another person’s mind about their values, they will often resist your attempts openly and stick to their guns. Persuasion is generally difficult and not straightforward.
  In a comment on my other post, you make an analogy of gradient descent for evolution. Evolution and individual human learning are extremely different processes. How could they both be relevant analogies? For what it’s worth, I think they’re both poor analogies.
  If the analogy between gradient descent and human learning were useful, I’d expect to be able to describe which characteristics of human value learning correspond to each part of the training process. For hypothetical TAI in fine-tuning, here’s the training setup:
  1. Training goal: following directions subject to ethical considerations.
  2. Reward: some sort of human (or AI) feedback on the quality of outputs. Gradient descent makes updates on this in a roughly deterministic way.
  3. Prompt: the model will also have some sort of prompt describing the training goal, and pre-training will provide the necessary concepts to make use of this information.
  But I find the training set-up for human value learning much more complicated and harder to describe in this way. What is the high-level training setup? What’s the training goal? What’s the reward? It’s my impression that when people change their minds about things, it’s often mediated by key factors like persuasive argument, personality traits, and social proof. Reward circuitry probably is involved somehow, but it seems vastly more complicated than that. Human values are also incredibly messy and poorly defined.
  Even if gradient descent were a good analogy, the way we raise children is very different from how we train ML models. ML training is much more structured and carefully planned with a clear reward signal. It seems like people learn values more from observing and talking to others.
  If human learning were similar to gradient descent, how would you explain that some people read about effective altruism (or any other philosophy) and quickly change their values? This seems like a very different process from gradient descent, and it’s not clear to me what the reward signal parallel would be in this case. To some extent, we seem to decide what our values are, and that probably makes sense for a social species from an evolutionary perspective.
  It seems like this discussion would benefit if we consulted someone with expertise on human value formation.
  5.
  I’m not sure I agree with your conclusion about the importance of fuzzy, complicated targets. Naively I’d expect that makes it harder, because it makes simple proxies look relatively good by comparison to the target. I think you should flesh out your argument more.
  Yeah, that’s reasonable. Thanks for pointing it out. This is a holdover from an argument that I removed from my second post before publishing because I no longer endorse it. A better argument is probably about satisficing targets instead of optimizing targets, but I think this is mostly a distraction at this point. I replaced “fuzzy targets” with “non-maximization targets”.
  What links here?