I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.
In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).
One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.
One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to fine-tune already pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
(I think this a really intriguing hypothesis; strong-upvote)
Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set.
This is also true of evolutionary selection mechanisms, and I think that metaphor is quite apt.
I agree the evolutionary metaphor works in this regard, because of the repeated interplay between small variations and selection.
The caution is against only thinking about the selection part — thinking of gradient descent as just a procedure that, when done, gives you a model of low loss, from the space of possible models.
In particular, there’s this section in the post:
Consider a hypothetical model that chooses actions by optimizing towards some internal goal which is highly correlated with the reward that would be assigned by a human overseer.
Obviously, RL is going to exhibit selection pressure towards such a model.
It is not obvious to me that RL will exhibit selection pressure towards such a model! That depends on what models are nearby in parameter space. That model may have very high reward, but the models nearby could have low reward, in which case there’s no path to it.
So RL is similar to evolutionary selection in the sense that after each iteration there a reachable space, and the space only narrows (never widens) with each iteration.
E.g. fish could evolve to humans and to orcas, but orcas cannot evolve to humans?
(I don’t think this analogy actually works very well.)
Analogy seems okay by me, because I don’t think “the space only narrows (never widens) with each iteration” is true about RL or about evolutionary selection!
Because investments close off parts of solution space?
I guess I’m imagining something like a tree. Nodes can reach all their descendants, but a node cannot reach any of its siblings descendants. As you move deeper into the tree, the reachable nodes becomes strictly smaller.
Like, I think that the solution space in both cases is effectively unbounded and traversable in any direction, with only a tiny number of solutions that have ever been instantiated at any given point (in evolutionary history/in the training process), and at each iteration there are tons of “particles” (genomes/circuits) trying out new configurations. Plus if you account for the fact that the configuration space can get bigger over time (genomes can grow longer/agents can accumulate experiences) then I think you can really just keep on finding new configurations ’til the cows come home. Yes, the likelihood of ever instantiating the same one twice is tiny, but instantiating the same trait/behavior twice? Happens all the time, even within the same lineage. Looks like in biology, there’s even a name for it!
If there’s a gene in the population and a totally new mutation arises, now you have both the original and the mutated version floating somewhere in the population, which slightly expands the space of explored genomes (err, “slightly” relative to the exponentially-big space of all possible genomes). Even if that mutated version takes over because it increases fitness in a niche this century, that niche could easily change next century, and there’s so much mutation going on that I don’t see why the original variant couldn’t arise again. Come to think of it, the constant changeover of environmental circumstances in evolution kinda reminds me of nonstationarity in RL...
The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough—it’s really hard for me to say that one string of gibberish is better/worse.
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don’t phrase it in terms of basins and path dependence, but they’re a great frame to look at it with.
Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that’s what I feel like)
So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)
I think a more sinister problem with ML and mostly alignment is linguistic abstraction. This post is a good example where the author is treating reinforcement learning like how we would understand the words “reinforcement learning” in English layman terms. It has to do with 1 reinforcement (rewards) 2 machine learning. You are taking the name of a ML algorithm too literally. Let me show you:
However, if at test-time you move the coin so it is now on the left-hand side of the level, the agent will not navigate to the coin, but instead continue navigating to the right-hand side of the level.
This is just over-fitting.
if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This is just over-fitting too.
The same thing happens with relating neural networks to actual neuroscience. It started out with neuroscience inspiring ML, but now because ML with NN is so successful, it’s inspiring neuroscience as well. It seems like we are stuck in established models mentally. Like LeCun’s recent paper on AGI is based on human cognition too. We are so obsessed with the word “intelligence” these days, it feels more like a constraint than inspiring perspective on what you may generalize AI and ML as statistically computation. I think alignment problem mostly has to do with how we are using ML system (i.e. what domain we are using these systems in), rather than the systems themselves. Whether it’s inspired by the human brain or something else, at the end of the day, it’s just doing statistical computations. It’s really what you do with the computed results that has further implications that alignment is mostly concerned about.
A model without a prior is the uniform distribution. It is the least over-fitted model that you can possibly have. Then you go through the learning process and over-fit and under-fit multiple times to get a more accurate model. It will never be perfect because the data will never be perfect. If your training data is on papers before 2010, then you might be over-fitting if you are using the same model to test on papers after 2010.
I like this post a lot and I agree that much alignment discussion is confused, treating RL agents as if they’re classical utility maximizers, where reward is the utility they’re maximizing.
In fact, they may or may not be “trying” to maximize anything at all. If they are, that’s only something that starts happening as a result of training, not from the start. And in that case, it may or may not be reward that they’re trying to maximize (if not, this is sometimes called inner alignment failure), and it’s probably not reward in future episodes (which seems to be the basis for some concerns around “situationally aware” agents acting nicely during training so they can trick us and get to act evil after training when they’re more powerful).
One caveat with the selection metaphor though: it can be misleading in its own way. Taken naively, it implies something like that we’re selecting uniformly from all possible random initializations which would get very small loss on the training set. In fact, gradient descent will prefer points at the bottom of large attractor basins of somewhat small loss, not just points which have very small loss in isolation. This is even before taking into account the nonstationarity of the training data in a typical reinforcement learning setting, due to the sampled trajectories changing over time as the agent itself changes.
One way this distinction can matter: if two policies get equally good reward, but one is “more risky” in that a slightly less competent version of the policy gets extremely poor reward, then that one’s less likely to be selected for.
This might actually suggest a strategy for training out deception: do it early and intensely, before the model becomes competent at it, punishing detectable deception (when e.g. interpretability tools can reveal it) much more than honest mistakes, with the hope of knocking the model out of any attractor basin for very deceptive behavior early on, when we can clearly see it, rather than later on, when its deceptions have gotten good enough that we have trouble detecting them. (This assumes that there is an “honesty” attractor basin, i.e. that low-competence versions of honesty generalize naturally, remaining honest as models become more competent. If not, then this fact might itself be apparent for multiple increments of competence prior to the model getting good enough to frequently trick us, or even being situationally aware enough that it acts as if it were honest because it knows it’s not good enough to trick us.)
More generally, this is suggestive of the idea: to the extent possible, train values before training competence. This in turn implies that it’s a mistake to only fine-tune fully pre-trained language models on human feedback, because by then they already have concepts like “obvious lie” vs. “nonobvious lie”, and fine-tuning may just push them from preferring the first to the second. Instead, some fine-tuning should happen as early as possible.
[ETA: Just want to clarify that the last two paragraphs are pretty speculative and possibly wrong or overstated! I was mostly thinking out loud. Definitely would like to hear good critiques of this.
Also changed a few words around for clarity.]
(I think this a really intriguing hypothesis; strong-upvote)
This is also true of evolutionary selection mechanisms, and I think that metaphor is quite apt.
I agree the evolutionary metaphor works in this regard, because of the repeated interplay between small variations and selection.
The caution is against only thinking about the selection part — thinking of gradient descent as just a procedure that, when done, gives you a model of low loss, from the space of possible models.
In particular, there’s this section in the post:
It is not obvious to me that RL will exhibit selection pressure towards such a model! That depends on what models are nearby in parameter space. That model may have very high reward, but the models nearby could have low reward, in which case there’s no path to it.
So RL is similar to evolutionary selection in the sense that after each iteration there a reachable space, and the space only narrows (never widens) with each iteration.
E.g. fish could evolve to humans and to orcas, but orcas cannot evolve to humans? (I don’t think this analogy actually works very well.)
Analogy seems okay by me, because I don’t think “the space only narrows (never widens) with each iteration” is true about RL or about evolutionary selection!
Oh, do please explain.
Wait, why would it only narrow in either case?
Because investments close off parts of solution space?
I guess I’m imagining something like a tree. Nodes can reach all their descendants, but a node cannot reach any of its siblings descendants. As you move deeper into the tree, the reachable nodes becomes strictly smaller.
What does that correspond to?
Like, I think that the solution space in both cases is effectively unbounded and traversable in any direction, with only a tiny number of solutions that have ever been instantiated at any given point (in evolutionary history/in the training process), and at each iteration there are tons of “particles” (genomes/circuits) trying out new configurations. Plus if you account for the fact that the configuration space can get bigger over time (genomes can grow longer/agents can accumulate experiences) then I think you can really just keep on finding new configurations ’til the cows come home. Yes, the likelihood of ever instantiating the same one twice is tiny, but instantiating the same trait/behavior twice? Happens all the time, even within the same lineage. Looks like in biology, there’s even a name for it!
If there’s a gene in the population and a totally new mutation arises, now you have both the original and the mutated version floating somewhere in the population, which slightly expands the space of explored genomes (err, “slightly” relative to the exponentially-big space of all possible genomes). Even if that mutated version takes over because it increases fitness in a niche this century, that niche could easily change next century, and there’s so much mutation going on that I don’t see why the original variant couldn’t arise again. Come to think of it, the constant changeover of environmental circumstances in evolution kinda reminds me of nonstationarity in RL...
The issue with early finetuning is that there’s not much that humans can actually select on, because the models aren’t capable enough—it’s really hard for me to say that one string of gibberish is better/worse.
That’s why I say as early as possible, and not right from the very start.
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don’t phrase it in terms of basins and path dependence, but they’re a great frame to look at it with.
Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that’s what I feel like)
So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)
I think a more sinister problem with ML and mostly alignment is linguistic abstraction. This post is a good example where the author is treating reinforcement learning like how we would understand the words “reinforcement learning” in English layman terms. It has to do with 1 reinforcement (rewards) 2 machine learning. You are taking the name of a ML algorithm too literally. Let me show you:
This is just over-fitting.
This is just over-fitting too.
The same thing happens with relating neural networks to actual neuroscience. It started out with neuroscience inspiring ML, but now because ML with NN is so successful, it’s inspiring neuroscience as well. It seems like we are stuck in established models mentally. Like LeCun’s recent paper on AGI is based on human cognition too. We are so obsessed with the word “intelligence” these days, it feels more like a constraint than inspiring perspective on what you may generalize AI and ML as statistically computation. I think alignment problem mostly has to do with how we are using ML system (i.e. what domain we are using these systems in), rather than the systems themselves. Whether it’s inspired by the human brain or something else, at the end of the day, it’s just doing statistical computations. It’s really what you do with the computed results that has further implications that alignment is mostly concerned about.
A model without a prior is the uniform distribution. It is the least over-fitted model that you can possibly have. Then you go through the learning process and over-fit and under-fit multiple times to get a more accurate model. It will never be perfect because the data will never be perfect. If your training data is on papers before 2010, then you might be over-fitting if you are using the same model to test on papers after 2010.