Such arguments claim we will see a similar transition while training AIs, with SGD creating some ‘inner thing’ which is not SGD and which gains capabilities much faster than SGD can insert them into the AI. Then, just like human civilization exploded in capabilities over a tiny evolutionary time frame, so too will AIs explode in capabilities over a tiny “SGD time frame”.
I don’t think this is an accurate summary of the argument for the plausibility of a sharp left turn. The post you link doesn’t actually mention gradient descent at all. This inaccuracy is load-bearing, because the most capable AI systems in the near future seem likely to gain many kinds of capabilities through methods that don’t resemble SGD, long before they are expected to take a sharp left turn.
For example, Auto-GPT is composed of foundation models which were trained with SGD and RHLF, but there are many ways to enhance the capabilities of Auto-GPT that do not involve further training of the foundation models. The repository currently has hundreds of open issues and pull requests. Perhaps a future instantiation of Auto-GPT will be able to start closing these PRs on its own, but in the meantime there are plenty of humans doing that work.
It doesn’t mention the literal string “gradient descent”, but it clearly makes reference to the current methodology of training AI systems (which is gradient descent). E.g., here:
The techniques OpenMind used to train it away from the error where it convinces itself that bad situations are unlikely? Those generalize fine. The techniques you used to train it to allow the operators to shut it down? Those fall apart, and the AGI starts wanting to avoid shutdown, including wanting to deceive you if it’s useful to do so.
The implication is that the dangerous behaviors that manifest during the SLT are supposed to have been instilled (at least partially) during the training (gradient descent) process.
However, the above is a nitpick. The real issue I have with your comment is that you seem to be criticizing me for not addressing the “capabilities come from not-SGD” threat scenario, when addressing that threat scenario is what this entire post is about.
Here’s how I described SLT (which you literally quoted): “SGD creating some ‘inner thing’ which is not SGD and which gains capabilities much faster than SGD can insert them into the AI.”
This is clearly pointing to a risk scenario where something other than SGD produces the SLT explosion of capabilities.
You say:
For example, Auto-GPT is composed of foundation models which were trained with SGD and RHLF, but there are many ways to enhance the capabilities of Auto-GPT that do not involve further training of the foundation models. The repository currently has hundreds of open issues and pull requests. Perhaps a future instantiation of Auto-GPT will be able to start closing these PRs on its own, but in the meantime there are plenty of humans doing that work.
To which I say, yes, that’s an example where SGD creates some ‘inner thing’ (the ability to contribute to the Auto-GPT repository), which (one might imagine) would let Auto-GPT “gain capabilities much faster than SGD can insert them into the AI.” This is exactly the sort of thing that I’m talking about in this post, and am saying it won’t lead to an SLT.
(Or at least, that the evolutionary example provides no reason to think that Auto-GPT might undergo an SLT, because the evolutionary SLT relied on the sudden unleash of ~9 OOM extra available optimization power, relative to the previous mechanisms of capabilities accumulation over time.)
Finally, I’d note that a very significant portion of this post is explicitly focused on discussing “non-SGD” mechanisms of capabilities improvement. Everything from “Fast takeoff is still possible” and on is specifically about such scenarios.
However, the above is a nitpick. The real issue I have with your comment is that you seem to be criticizing me for not addressing the “capabilities come from not-SGD” threat scenario, when addressing that threat scenario is what this entire post is about.
That’s not what I’m criticizing you for. I elaborated a bit more here; my criticism is that this post sets up a straw version of the SLT argument to knock down, of which assuming it applies narrowly to “spiky” capability gains via SGD is one example.
The actual SLT argument is about a capabilities regime (human-level+), not a specific method for reaching it or how many OOM of optimization power are applied before or after.
The reasons to expect a phase shift in such a regime are because (by definition) a human-level AI is capable of reflection, deception, having insights that no other human has had before (as current humans sometimes do), etc.
Note, I’m not saying that setting up a strawman automatically invalidates all the rest of your claims, nor that you’re obligated to address every possible kind of criticism. But I am claiming that you aren’t passing the ITT of someone who accepts the original SLT argument (and probably its author).
But it does mean you can’t point to support for your claim that evolution provides no evidence for the Pope!SLT, as support for the claim that evolution provides no evidence for the Soares!SLT, and expect that to be convincing to anyone who doesn’t already accept that Pope!SLT == Soares!SLT.
Then my disagreement is that I disagree with the claim that the human regime is very special, or that there’s any reason to attach much specialness to human-level intelligence.
In essence, I agree with a weaker version of Quintin Pope’s comment here:
“AGI” is not the point at which the nascent “core of general intelligence” within the model “wakes up”, becomes an “I”, and starts planning to advance its own agenda. AGI is just shorthand for when we apply a sufficiently flexible and regularized function approximator to a dataset that covers a sufficiently wide range of useful behavioral patterns.
There are no “values”, “wants”, “hostility”, etc. outside of those encoded in the structure of the training data (and to a FAR lesser extent, the model/optimizer inductive biases). You can’t deduce an AGI’s behaviors from first principles without reference to that training data. If you don’t want an AGI capable and inclined to escape, don’t train it on data[1] that gives it the capabilities and inclination to escape.
Putting it another way, I suspect you’re suffering from the fallacy of generalizing from fiction, since fictional portrayals make it far more discontinuous and misaligned ala the Terminator than what happens in reality.
I don’t think this is an accurate summary of the argument for the plausibility of a sharp left turn. The post you link doesn’t actually mention gradient descent at all. This inaccuracy is load-bearing, because the most capable AI systems in the near future seem likely to gain many kinds of capabilities through methods that don’t resemble SGD, long before they are expected to take a sharp left turn.
For example, Auto-GPT is composed of foundation models which were trained with SGD and RHLF, but there are many ways to enhance the capabilities of Auto-GPT that do not involve further training of the foundation models. The repository currently has hundreds of open issues and pull requests. Perhaps a future instantiation of Auto-GPT will be able to start closing these PRs on its own, but in the meantime there are plenty of humans doing that work.
It doesn’t mention the literal string “gradient descent”, but it clearly makes reference to the current methodology of training AI systems (which is gradient descent). E.g., here:
The implication is that the dangerous behaviors that manifest during the SLT are supposed to have been instilled (at least partially) during the training (gradient descent) process.
However, the above is a nitpick. The real issue I have with your comment is that you seem to be criticizing me for not addressing the “capabilities come from not-SGD” threat scenario, when addressing that threat scenario is what this entire post is about.
Here’s how I described SLT (which you literally quoted): “SGD creating some ‘inner thing’ which is not SGD and which gains capabilities much faster than SGD can insert them into the AI.”
This is clearly pointing to a risk scenario where something other than SGD produces the SLT explosion of capabilities.
You say:
To which I say, yes, that’s an example where SGD creates some ‘inner thing’ (the ability to contribute to the Auto-GPT repository), which (one might imagine) would let Auto-GPT “gain capabilities much faster than SGD can insert them into the AI.” This is exactly the sort of thing that I’m talking about in this post, and am saying it won’t lead to an SLT.
(Or at least, that the evolutionary example provides no reason to think that Auto-GPT might undergo an SLT, because the evolutionary SLT relied on the sudden unleash of ~9 OOM extra available optimization power, relative to the previous mechanisms of capabilities accumulation over time.)
Finally, I’d note that a very significant portion of this post is explicitly focused on discussing “non-SGD” mechanisms of capabilities improvement. Everything from “Fast takeoff is still possible” and on is specifically about such scenarios.
That’s not what I’m criticizing you for. I elaborated a bit more here; my criticism is that this post sets up a straw version of the SLT argument to knock down, of which assuming it applies narrowly to “spiky” capability gains via SGD is one example.
The actual SLT argument is about a capabilities regime (human-level+), not a specific method for reaching it or how many OOM of optimization power are applied before or after.
The reasons to expect a phase shift in such a regime are because (by definition) a human-level AI is capable of reflection, deception, having insights that no other human has had before (as current humans sometimes do), etc.
Note, I’m not saying that setting up a strawman automatically invalidates all the rest of your claims, nor that you’re obligated to address every possible kind of criticism. But I am claiming that you aren’t passing the ITT of someone who accepts the original SLT argument (and probably its author).
But it does mean you can’t point to support for your claim that evolution provides no evidence for the Pope!SLT, as support for the claim that evolution provides no evidence for the Soares!SLT, and expect that to be convincing to anyone who doesn’t already accept that Pope!SLT == Soares!SLT.
Then my disagreement is that I disagree with the claim that the human regime is very special, or that there’s any reason to attach much specialness to human-level intelligence.
In essence, I agree with a weaker version of Quintin Pope’s comment here:
https://forum.effectivealtruism.org/posts/zd5inbT4kYKivincm/?commentId=Zyz9j9vW8Ai5eZiFb
Putting it another way, I suspect you’re suffering from the fallacy of generalizing from fiction, since fictional portrayals make it far more discontinuous and misaligned ala the Terminator than what happens in reality.
Link below:
https://www.lesswrong.com/posts/rHBdcHGLJ7KvLJQPk/the-logical-fallacy-of-generalization-from-fictional