Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
Grant that inner optimizer ~billions of times greater optimization power than the outer optimizer.[4]
Let the inner optimizer run freely without any supervision, limits or interventions from the outer optimizer.[5]
I think that the conditions for an SLT to arrive are weaker than you describe.
For (1), it’s unclear to me why you think you need to have this multi-level inner structure.[1] If instead of reward circuitry inducing human values, evolution directly selected over policies, I’d expect similar inner alignment failures. It’s also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection.
For (2), it seems like you are conflating ‘amount of real world time’ with ‘amount of consequences-optimization’. SGD is just a much less efficient optimizer than intelligent cognition—in-context learning happens much faster than SGD learning. When the inner optimizer starts learning and accumulating knowledge, it seems totally plausible to me that this will happen on much faster timescales than the outer selection.
For (3), I don’t think that the SLT requires the inner optimizer to run freely, it only requires one of:
a. the inner optimizer running much faster than the outer optimizer, such that the updates don’t occur in time.
b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss’s updates are ineffective.
Evolution, of course, does have this structure, with 2 levels of selection, it just doesn’t seem like this is a relevant property for thinking about the SLT.
I’m guessing you misunderstand what I meant when I referred to “the human learning process” as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn’t referring to human intelligence or what we might call human “in-context learning”. I was referring to the human brain’s update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human “base objective”. I was not referring to the intelligences that the human base optimizers build over a lifetime.
If instead of reward circuitry inducing human values, evolution directly selected over policies, I’d expect similar inner alignment failures.
Most of the difference in outcomes between human biological evolution and DL comes down to the fact that bio evolution has a wildly different mapping from parameters to functional behaviors, as compared to DL. E.g.,
Bio evolution’s parameters are the genome, which mostly configures learning proclivities and reward circuitry of the human within lifetime learning process, as opposed to DL parameters being actual parameters which are much more able to directly specify particular behaviors.
The “functional output” of human bio evolution isn’t actually the behaviors of individual humans. Rather, it’s the tendency of newborn humans to learn behaviors in a given environment. It’s not like in DL, where you can train a model, then test that same model in a new environment. Rather, optimization over the human genome in the ancestral environment produced our genome, and now a fresh batch of humans arise and learn behaviors in the modern environment.
Point 2 is the distinction I was referencing when I said:
“human behavior in the ancestral environment” versus “human behavior in the modern environment” isn’t a valid example of behavioral differences between training and deployment environments.
Overall, bio evolution is an incredibly weird optimization process, with specific quirks that predictably cause very different outcomes as compared to either DL or human within lifetime learning. As a result, bio evolution outcomes have very little implication for DL. It’s deeply wrong to lump them all under the same “hill climbing paradigm”, and assume they’ll all have the same dynamics.
It’s also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection.
This ties into the misunderstanding I think you made. When I said:
Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives.[3]
The “inner loss function” I’m talking about here is not human values, but instead whatever mix of predictive loss, reward maximization, etc that form the effective optimization criterion for the brain’s “base” distributed quasi-Hebbian/whatever optimization process. Such an “inner loss function” in the context of contemporary AI systems would not refer to the “inner values” that arise as a consequence of running SGD over a bunch of training data. They’d be something much much weirder and very different from current practice.
E.g., if we had a meta-learning setup where the top-level optimizer automatically searches for a reward function F, which, when used in another AI’s training, will lead to high scores on some other criterion C, via the following process:
Randomly initializing a population of models.
Training them with the current reward function F.
Evaluate those models on C.
Update the reward function F to be better at training models to score highly on C.
The “inner loss function” I was talking about in the post would be most closely related to F. And what I mean by “Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives”, in the context of the above meta-learning setup, is to point to the relationship between F and C.
Specifically, does F actually reward the AIs for doing well on C? Or, as with humans, does F only reward the AIs for achieving shallow environmental correlates of scoring well on C? If the latter, then you should obviously consider that, if you create a new batch of AIs in a fresh environment, and train them on an unmodified reward function F, that the things F rewards will become decoupled from the AIs eventually doing well on C.
Returning to humans:
Inclusive genetic fitness is incredibly difficult to “directly” train an organism to maximize. Firstly, IGF can’t actually be measured in an organism’s lifetime, only estimated based on the observable states of the organism’s descendants. Secondly, “IGF estimated from observing descendants” makes for a very difficult reward signal to learn on because it’s so extremely sparse, and because the within-lifetime actions that lead to having more descendants are often very far in time away from being able to actually observe those descendants. Thus, any scheme like “look at descendants, estimate IGF, apply reward proportional to estimated IGF” would completely fail at steering an organism’s within lifetime learning towards IGF-increasing actions.
Evolution, being faced with standard RL issues of reward sparseness and long time horizons, adopted a standard RL solution to those issues, namely reward shaping. E.g., rather than rewarding organisms for producing offspring, it builds reward circuitry that reward organisms for precursors to having offspring, such as having sex, which allows rewards to be more frequent and closer in time to the behaviors they’re supposed to reinforce.
In fact, evolution relies so heavily on reward shaping that I think there’s probably nothing in the human reward system that directly rewards increased IGF, at least not in the direct manner an ML researcher could by running a self-replicating model a bunch of times in different environments, measuring the resulting “IGF” of each run, and directly rewarding the model in proportion to its “IGF”.
This is the thing I was actually referring to when I mentioned “inner optimizer, whose inner loss function includes no mention of human values / objectives.”: the human loss / reward functions not directly including IGF in the human “base objective”.
(Note that we won’t run into similar issues with AI reward functions vs human values. This is partially because we have much more flexibility in what we include in a reward function as compared to evolution (e.g., we could directly train an AI on estimated IGF). Mostly though, it’s because the thing we want to align our models to, human values, have already been selected to be the sorts of things that can be formed via RL on shaped reward functions, because that’s how they actually arose at all.)
For (2), it seems like you are conflating ‘amount of real world time’ with ‘amount of consequences-optimization’. SGD is just a much less efficient optimizer than intelligent cognition
Again, the thing I’m pointing to as the source of the human-evolutionary sharp left turn isn’t human intelligence. It’s a change in the structure of how optimization power (coming from the “base objective” of the human brain’s updating process) was able to contribute to capabilities gains over time. If human evolution were an ML experiment, the key change I’m pointing to isn’t “the models got smart”. It’s “the experiment stopped being quite as stupidly wasteful of compute” (which happened because the models got smart enough to exploit a side-channel in the experiment’s design that allowed them to pass increasing amounts of information to future generations, rather than constantly being reset to the same level each time). Then, the reason this won’t happen in AI development is that there isn’t a similarly massive overhang of completely misused optimization power / compute, which could be unleashed via a single small change to the training process.
in-context learning happens much faster than SGD learning.
Is it really? I think they’re overall comparable ‘within an OOM’, just useful for different things. It’s just much easier to prompt a model and immediately see how this changes its behavior, but on head-to-head comparisons, it’s not at all clear that prompting wins out. E.g., Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
In particular, I think prompting tends to be more specialized to getting good performance in situations similar to those the model has seen previously, whereas training (with appropriate data) is more general in the directions in which it can move capabilities. Extreme example: pure language models can be few-shot prompted to do image classification, but are very bad at it. However, they can be directly trained into capable multi-modal models.
I think this difference between in-context vs SGD learning makes it unlikely that in-context learning alone will suffice for an explosion in general intelligence. If you’re only sampling from the probability distribution created by a training process, then you can’t update that distribution, which I expect will greatly limit your ability to robustly generalize to new domains, as compared to a process where you gather new data from those domains and update the underlying distribution with those data.
For (3), I don’t think that the SLT requires the inner optimizer to run freely, it only requires one of:
a. the inner optimizer running much faster than the outer optimizer, such that the updates don’t occur in time.
b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss’s updates are ineffective.
(3) is mostly there to point to the fact that evolution took no corrective action whatsoever in regards to humans. Evolution can’t watch humans’ within lifetime behavior, see that they’re deviating away from the “intended” behavior, and intervene in their within lifetime learning processes to correct such issues.
Human “inner learners” take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.
I think I understand these points, and I don’t see how this contradicts what I’m saying. I’ll try rewording.
Consider the following gaussian process:
Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don’t have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.
Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven’t generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.
You’ve argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. Myclaim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Sincemost generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn’t load bearing on this point—the related load bearing assumption is the lack of a an inductive bias that is benign.
If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don’t have to worry about this, but as yet, I don’t see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.
It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can’t make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like “because of inductive bias from the NN architecture, we’ll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward.”
I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there’s a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don’t see why you think that these types of functions would be more likely to misgeneralize.
(only responding to the first part of your comment now, may add on additional content later)
I think that the conditions for an SLT to arrive are weaker than you describe.
For (1), it’s unclear to me why you think you need to have this multi-level inner structure.[1] If instead of reward circuitry inducing human values, evolution directly selected over policies, I’d expect similar inner alignment failures. It’s also not necessary that the inner values of the agent make no mention of human values / objectives, it needs to both a) value them enough to not take over, and b) maintain these values post-reflection.
For (2), it seems like you are conflating ‘amount of real world time’ with ‘amount of consequences-optimization’. SGD is just a much less efficient optimizer than intelligent cognition—in-context learning happens much faster than SGD learning. When the inner optimizer starts learning and accumulating knowledge, it seems totally plausible to me that this will happen on much faster timescales than the outer selection.
For (3), I don’t think that the SLT requires the inner optimizer to run freely, it only requires one of:
a. the inner optimizer running much faster than the outer optimizer, such that the updates don’t occur in time.
b. the inner optimizer does gradient hacking / exploration hacking, such that the outer loss’s updates are ineffective.
Evolution, of course, does have this structure, with 2 levels of selection, it just doesn’t seem like this is a relevant property for thinking about the SLT.
I’m guessing you misunderstand what I meant when I referred to “the human learning process” as the thing that was a ~ 1 billion X stronger optimizer than evolution and responsible for the human SLT. I wasn’t referring to human intelligence or what we might call human “in-context learning”. I was referring to the human brain’s update rules / optimizer: i.e., whatever quasi-Hebbian process the brain uses to minimize sensory prediction error, maximize reward, and whatever else factors into the human “base objective”. I was not referring to the intelligences that the human base optimizers build over a lifetime.
I very strongly disagree with this. “Evolution directly selecting over policies” in an ML context would be equivalent to iterated random search, which is essentially a zeroth-order approximation to gradient descent. Under certain simplifying assumptions, they are actually equivalent. It’s the loss landscape an parameter-function map that are responsible for most of a learning process’s inductive biases (especially for large amounts of data). See: Loss Landscapes are All You Need: Neural Network Generalization Can Be Explained Without the Implicit Bias of Gradient Descent.
Most of the difference in outcomes between human biological evolution and DL comes down to the fact that bio evolution has a wildly different mapping from parameters to functional behaviors, as compared to DL. E.g.,
Bio evolution’s parameters are the genome, which mostly configures learning proclivities and reward circuitry of the human within lifetime learning process, as opposed to DL parameters being actual parameters which are much more able to directly specify particular behaviors.
The “functional output” of human bio evolution isn’t actually the behaviors of individual humans. Rather, it’s the tendency of newborn humans to learn behaviors in a given environment. It’s not like in DL, where you can train a model, then test that same model in a new environment. Rather, optimization over the human genome in the ancestral environment produced our genome, and now a fresh batch of humans arise and learn behaviors in the modern environment.
Point 2 is the distinction I was referencing when I said:
Overall, bio evolution is an incredibly weird optimization process, with specific quirks that predictably cause very different outcomes as compared to either DL or human within lifetime learning. As a result, bio evolution outcomes have very little implication for DL. It’s deeply wrong to lump them all under the same “hill climbing paradigm”, and assume they’ll all have the same dynamics.
This ties into the misunderstanding I think you made. When I said:
The “inner loss function” I’m talking about here is not human values, but instead whatever mix of predictive loss, reward maximization, etc that form the effective optimization criterion for the brain’s “base” distributed quasi-Hebbian/whatever optimization process. Such an “inner loss function” in the context of contemporary AI systems would not refer to the “inner values” that arise as a consequence of running SGD over a bunch of training data. They’d be something much much weirder and very different from current practice.
E.g., if we had a meta-learning setup where the top-level optimizer automatically searches for a reward function F, which, when used in another AI’s training, will lead to high scores on some other criterion C, via the following process:
Randomly initializing a population of models.
Training them with the current reward function F.
Evaluate those models on C.
Update the reward function F to be better at training models to score highly on C.
The “inner loss function” I was talking about in the post would be most closely related to F. And what I mean by “Deliberately create a (very obvious[2]) inner optimizer, whose inner loss function includes no mention of human values / objectives”, in the context of the above meta-learning setup, is to point to the relationship between F and C.
Specifically, does F actually reward the AIs for doing well on C? Or, as with humans, does F only reward the AIs for achieving shallow environmental correlates of scoring well on C? If the latter, then you should obviously consider that, if you create a new batch of AIs in a fresh environment, and train them on an unmodified reward function F, that the things F rewards will become decoupled from the AIs eventually doing well on C.
Returning to humans:
Inclusive genetic fitness is incredibly difficult to “directly” train an organism to maximize. Firstly, IGF can’t actually be measured in an organism’s lifetime, only estimated based on the observable states of the organism’s descendants. Secondly, “IGF estimated from observing descendants” makes for a very difficult reward signal to learn on because it’s so extremely sparse, and because the within-lifetime actions that lead to having more descendants are often very far in time away from being able to actually observe those descendants. Thus, any scheme like “look at descendants, estimate IGF, apply reward proportional to estimated IGF” would completely fail at steering an organism’s within lifetime learning towards IGF-increasing actions.
Evolution, being faced with standard RL issues of reward sparseness and long time horizons, adopted a standard RL solution to those issues, namely reward shaping. E.g., rather than rewarding organisms for producing offspring, it builds reward circuitry that reward organisms for precursors to having offspring, such as having sex, which allows rewards to be more frequent and closer in time to the behaviors they’re supposed to reinforce.
In fact, evolution relies so heavily on reward shaping that I think there’s probably nothing in the human reward system that directly rewards increased IGF, at least not in the direct manner an ML researcher could by running a self-replicating model a bunch of times in different environments, measuring the resulting “IGF” of each run, and directly rewarding the model in proportion to its “IGF”.
This is the thing I was actually referring to when I mentioned “inner optimizer, whose inner loss function includes no mention of human values / objectives.”: the human loss / reward functions not directly including IGF in the human “base objective”.
(Note that we won’t run into similar issues with AI reward functions vs human values. This is partially because we have much more flexibility in what we include in a reward function as compared to evolution (e.g., we could directly train an AI on estimated IGF). Mostly though, it’s because the thing we want to align our models to, human values, have already been selected to be the sorts of things that can be formed via RL on shaped reward functions, because that’s how they actually arose at all.)
Again, the thing I’m pointing to as the source of the human-evolutionary sharp left turn isn’t human intelligence. It’s a change in the structure of how optimization power (coming from the “base objective” of the human brain’s updating process) was able to contribute to capabilities gains over time. If human evolution were an ML experiment, the key change I’m pointing to isn’t “the models got smart”. It’s “the experiment stopped being quite as stupidly wasteful of compute” (which happened because the models got smart enough to exploit a side-channel in the experiment’s design that allowed them to pass increasing amounts of information to future generations, rather than constantly being reset to the same level each time). Then, the reason this won’t happen in AI development is that there isn’t a similarly massive overhang of completely misused optimization power / compute, which could be unleashed via a single small change to the training process.
Is it really? I think they’re overall comparable ‘within an OOM’, just useful for different things. It’s just much easier to prompt a model and immediately see how this changes its behavior, but on head-to-head comparisons, it’s not at all clear that prompting wins out. E.g., Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than In-Context Learning
In particular, I think prompting tends to be more specialized to getting good performance in situations similar to those the model has seen previously, whereas training (with appropriate data) is more general in the directions in which it can move capabilities. Extreme example: pure language models can be few-shot prompted to do image classification, but are very bad at it. However, they can be directly trained into capable multi-modal models.
I think this difference between in-context vs SGD learning makes it unlikely that in-context learning alone will suffice for an explosion in general intelligence. If you’re only sampling from the probability distribution created by a training process, then you can’t update that distribution, which I expect will greatly limit your ability to robustly generalize to new domains, as compared to a process where you gather new data from those domains and update the underlying distribution with those data.
(3) is mostly there to point to the fact that evolution took no corrective action whatsoever in regards to humans. Evolution can’t watch humans’ within lifetime behavior, see that they’re deviating away from the “intended” behavior, and intervene in their within lifetime learning processes to correct such issues.
Human “inner learners” take ~billions of inner steps for each outer evolutionary step. In contrast, we can just assign whatever ratio of supervisory steps to runtime execution steps, and intervene whenever we want.
Thanks for the response!
I think I understand these points, and I don’t see how this contradicts what I’m saying. I’ll try rewording.
Consider the following gaussian process:
Each blue line represents a possible fit of the training data (the red points), and so which one of these is selected by a learning process is a question of inductive bias. I don’t have a formalization, but I claim: if your data-distribution is sufficiently complicated, by default, OOD generalization will be poor.
Now, you might ask, how is this consistent with capabilities to generalizing? I note that they haven’t generalized all that well so far, but once they do, it will be because the learned algorithm has found exploitable patterns in the world and methods of reasoning that generalize far OOD.
You’ve argued that there are different parameter-function maps, so evolution and NNs will generalize differently, this is of course true, but I think its besides the point. My claim is that doing selection over a dataset with sufficiently many proxies that fail OOD without a particularly benign inductive bias leads (with high probability) to the selection of function that fails OOD. Since most generalizations are bad, we should expect that we get bad behavior from NN behavior as well as evolution. I continue to think evolution is valid evidence for this claim, and the specific inductive bias isn’t load bearing on this point—the related load bearing assumption is the lack of a an inductive bias that is benign.
If we had reasons to think that NNs were particularly benign and that once NNs became sufficiently capable, their alignment would also generalize correctly, then you could make an argument that we don’t have to worry about this, but as yet, I don’t see a reason to think that a NN parameter function map is more likely to lead to inductive biases that pick a good generalization by default than any other set of inductive biases.
It feels to me as if your argument is that we understand neither evolution nor NN inductive biases, and so we can’t make strong predictions about OOD generalization, so we are left with our high uncertainty prior over all of the possible proxies that we could find. It seems to me that we are far from being able to argue things like “because of inductive bias from the NN architecture, we’ll get non-deceptive AIs, even if there is a deceptive basin in the loss landscape that could get higher reward.”
I suspect you think bad misgeneralization happens only when you have a two layer selection process (and this is especially sharp when there’s a large time disparity between these processes), like evolution setting up the human within lifetime learning. I don’t see why you think that these types of functions would be more likely to misgeneralize.
(only responding to the first part of your comment now, may add on additional content later)