The “maximize all the variables” tendency in reasoning about AGI.
Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn’t the truth value of any given statement, but of a pattern across the statements:
If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example).
An AI will be trained to minimal loss on the training distribution.
SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.
Quintin: “In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model.”
Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
In a process where local modifications are applied to reduce some locally computed error (e.g. “subconsciously predicting the presence of a visual-field edge where there definitely was not an edge”), it’s just an enormous leap to go “and then the system minimizes this predictive error.”
I think this is echoing the class of mistake I critique in Reward!=Optimization Target: “there’s a local update on recent data d to increase metric f(d)” → “now I can use goal-laden language to describe the global properties of this process.” I think this kind of language (e.g. “the goal of the human learning process”, “minimize”) should make you sit bolt upright in alarm and indignation. Minimize in a local neighborhood at the given timestep, maybe.
Even if this claim (3) were true due to some kind of convergence result, that would be derived from a detailed analysis of learning dynamics. Not a situation where you can just look at an update rule and go “yeah kinda looks like it’s trying to globally minimize predictive loss, guess I can just start calling it that now.”
Policy networks are selected for getting as much reward as possible on training, so they’ll do that.
Response: “From my perspective, a bunch of selection-based reasoning draws enormous conclusions (e.g. high chance the policy cares about reward OOD) given vague / weak technical preconditions (e.g. policies are selected for reward) without attending to strength or mechanism or size of the trained network or training timescales or net direction of selection.”
SOTA agents often don’t get max reward on training.
See more of my comments in this thread for actual detailed responses.
Because AI researchers try to make reward number go up because that’s a nice legible statistic, we can’t anticipate in advance ways in which the trained policy won’t get maximal reward, otherwise the designers would as well.
An AI will effectively argmax over all plans according to its goals.
I think there are just a bunch of really strong claims like these, about some variable getting set really high or really low for some reason or another, and the claims now seem really weird to me. I initially feel confused why they’re being brought up / repeated with such strength and surety.
I speculate that there’s a combination of:
It feels less impressive to imagine a training run where the AGI only gets a high average policy-gradient-intensity score (i.e. reward), and not a maximally high number.
Wouldn’t a sophisticated mind not only be really good at realizing its values (like making paperclips), but also get minimal predictive loss?
(No.)
People don’t know how to set realistic parameter values for these questions because they’re confused about AGI / how intelligence works, and they don’t explicitly note that to themselves, so they search under the streetlight of what current theory can actually talk about. They set the <quantity> setting to MAX or MIN in weak part because limit-based reasoning is socially approved of.
But in math and in life, you really have to be careful about your limit-based reasoning!
IE you can make statements about global minima of the loss landscape. But what if training doesn’t get you there? That’s harder to talk about.[2]
“Argmax” or “loss minimization” feel relatively understandable, whereas “learned decision-making algorithms” are fuzzy and hard. But what if the former set has little to do with the latter?
My response here assumes that PP is about self-supervised learning on intermediate neuronal activations (e.g. “friend neuron”) and also immediate percepts (e.g. “firing of retinal cells”). Maybe I don’t understand PP if that’s not true.
I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn’t try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they’re able to make a profit, and dis-rewards firms which aren’t able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don’t (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don’t try to (only) maximize win-probability.
Democracies don’t try to (only) maximize voter approval.
Evolution doesn’t try to maximize inclusive genetic fitness.
Memes don’t try to maximize inclusive memetic fitness.
Academics don’t try to (only) maximize status.
China doesn’t maximize allegiance to the CCP.
I think there’s a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients), somehow perform some integration-analogue for a function which would produce those local updates, then find a local minimum of that “integrated” function and claim the system is at that minimum or can be approximated well by the system at that minimum. Generally, this seems constrained in empirical systems by common sense learned by experience with the system, but in less and less empirical systems (like the economy or SGD), people get more and more crazy because they have less learned common sense to guide them when making the analysis.
I think there’s a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients)
The “maximize all the variables” tendency in reasoning about AGI.
Here are some lines of thought I perceive, which are probably straw to varying extents for some people and real to varying extents for other people. I give varying responses to each, but the point isn’t the truth value of any given statement, but of a pattern across the statements:
If an AGI has a concept around diamonds, and is motivated in some way to make diamonds, it will make diamonds which maximally activate its diamond-concept circuitry (possible example).
My response.
An AI will be trained to minimal loss on the training distribution.
SGD does not reliably find minimum-loss configurations (modulo expressivity), in practice, in cases we care about. The existence of knowledge distillation is one large counterexample.
Quintin: “In terms of results about model distillation, you could look at appendix G.2 of the Gopher paper. They compare training a 1.4 billion parameter model directly, versus distilling a 1.4 B model from a 7.1 B model.”
Predictive processing means that the goal of the human learning process is to minimize predictive loss.[1]
In a process where local modifications are applied to reduce some locally computed error (e.g. “subconsciously predicting the presence of a visual-field edge where there definitely was not an edge”), it’s just an enormous leap to go “and then the system minimizes this predictive error.”
I think this is echoing the class of mistake I critique in Reward!=Optimization Target: “there’s a local update on recent data d to increase metric f(d)” → “now I can use goal-laden language to describe the global properties of this process.” I think this kind of language (e.g. “the goal of the human learning process”, “minimize”) should make you sit bolt upright in alarm and indignation. Minimize in a local neighborhood at the given timestep, maybe.
Even if this claim (3) were true due to some kind of convergence result, that would be derived from a detailed analysis of learning dynamics. Not a situation where you can just look at an update rule and go “yeah kinda looks like it’s trying to globally minimize predictive loss, guess I can just start calling it that now.”
Policy networks are selected for getting as much reward as possible on training, so they’ll do that.
Response: “From my perspective, a bunch of selection-based reasoning draws enormous conclusions (e.g. high chance the policy cares about reward OOD) given vague / weak technical preconditions (e.g. policies are selected for reward) without attending to strength or mechanism or size of the trained network or training timescales or net direction of selection.”
SOTA agents often don’t get max reward on training.
See more of my comments in this thread for actual detailed responses.
Because AI researchers try to make reward number go up because that’s a nice legible statistic, we can’t anticipate in advance ways in which the trained policy won’t get maximal reward, otherwise the designers would as well.
An AI will effectively argmax over all plans according to its goals.
Response: I think this is not how cognition works in reality. There is a deeper set of concepts to consider, here, which changes the analysis. See: Alignment allows “nonrobust” decision-influences and doesn’t require robust grading.
I think there are just a bunch of really strong claims like these, about some variable getting set really high or really low for some reason or another, and the claims now seem really weird to me. I initially feel confused why they’re being brought up / repeated with such strength and surety.
I speculate that there’s a combination of:
It feels less impressive to imagine a training run where the AGI only gets a high average policy-gradient-intensity score (i.e. reward), and not a maximally high number.
Wouldn’t a sophisticated mind not only be really good at realizing its values (like making paperclips), but also get minimal predictive loss?
(No.)
People don’t know how to set realistic parameter values for these questions because they’re confused about AGI / how intelligence works, and they don’t explicitly note that to themselves, so they search under the streetlight of what current theory can actually talk about. They set the
<quantity>
setting toMAX
orMIN
in weak part because limit-based reasoning is socially approved of.But in math and in life, you really have to be careful about your limit-based reasoning!
IE you can make statements about global minima of the loss landscape. But what if training doesn’t get you there? That’s harder to talk about.[2]
Or maybe some hapless RL researcher spends years thinking about optimal policies, only to wake up from the dream and realize that optimality results don’t have to tell us a damn thing about trained policies...
“Argmax” or “loss minimization” feel relatively understandable, whereas “learned decision-making algorithms” are fuzzy and hard. But what if the former set has little to do with the latter?
My response here assumes that PP is about self-supervised learning on intermediate neuronal activations (e.g. “friend neuron”) and also immediate percepts (e.g. “firing of retinal cells”). Maybe I don’t understand PP if that’s not true.
There can be very legit reasons to search under the streetlight, though. Sometimes that helps you see a bit farther out, by relaxing assumptions.
I think this type of criticism is applicable in an even wider range of fields than even you immediately imagine (though in varying degrees, and with greater or lesser obviousness or direct correspondence to the SGD case). Some examples:
Despite the economists, the economy doesn’t try to maximize welfare, or even net dollar-equivalent wealth. It rewards firms which are able to make a profit in proportion to how much they’re able to make a profit, and dis-rewards firms which aren’t able to make a profit. Firms which are technically profitable, but have no local profit incentive gradient pointing towards them (factoring in the existence of rich people and lenders, neither of which are perfect expected profit maximizers) generally will not happen.
Individual firms also don’t (only) try to maximize profit. Some parts of them may maximize profit, but most are just structures of people built from local social capital and economic capital incentive gradients.
Politicians don’t try to (only) maximize win-probability.
Democracies don’t try to (only) maximize voter approval.
Evolution doesn’t try to maximize inclusive genetic fitness.
Memes don’t try to maximize inclusive memetic fitness.
Academics don’t try to (only) maximize status.
China doesn’t maximize allegiance to the CCP.
I think there’s a general tendency for people to look at local updates in a system (when the system has humans as decision nodes, the local updates are called incentive gradients), somehow perform some integration-analogue for a function which would produce those local updates, then find a local minimum of that “integrated” function and claim the system is at that minimum or can be approximated well by the system at that minimum. Generally, this seems constrained in empirical systems by common sense learned by experience with the system, but in less and less empirical systems (like the economy or SGD), people get more and more crazy because they have less learned common sense to guide them when making the analysis.
very pithy. nice insight, thanks.