what sorts of cognitive capabilities can exist in reality, and
whether current (or future) training regimes are likely to find them
It sounds like you agree that the relevant cognitive capabilities are likely to exist, though maybe not for prime number factorization, and that it’s unclear whether they’d fit inside current architectures.
I do not read Eliezer as making a claim that future GPT-n generations will become perfect (or approximately perfect) text predictors. He is specifically rebutting claims others have made, that GPTs/etc can not become ASI, because e.g. they are “merely imitating” human text. This is not obviously true; to the extent that there exist some cognitive capabilities which are physically possible to instantiate in GPT-n model weights which can solve these prediction problems, and are within the region of possible outcomes of our training regimes (+ the data used for them), then it is possible that we will find them.
He is specifically rebutting claims others have made, that GPTs/etc can not become ASI, because e.g. they are “merely imitating” human text.
That may be, but I’m not seeing that context here. It ends up reading to me as “look how powerful a perfect predictor would be, (and? so?) if we keep training them we’re going to end up with a perfect predictor (and, I extrapolate, then we’re hosed)”.
I’m not trying to make any confident claim that GPT-whatever can’t become dangerous[1]. But I don’t think that talking about how powerful GPTs would be if they reached implausible performance levels really says anything at all about whether they’d be dangerous at plausible ones.
For that matter, even if you reached an implausible level, it’s still not obvious that you have a problem, given that the implausible capability would still be used for pure text prediction. Generating text that, say, manipulated humans and took over the world would be a prediction error, since no such text would ever arise from any of the sources being predicted. OK, unless it predicts that it’ll find its own output in the training data....
Although, even with plugins, there are a lot of kinds of non-prediction-like capabilities I’d need to see before I thought a system was obviously dangerous [2].
But I don’t think that talking about how powerful GPTs would be if they reached implausible performance levels really says anything at all about whether they’d be dangerous at plausible ones.
This seems like it’s assuming the conclusion (that reaching dangerous capabilities using these architectures is implausible). Eliezer did consider it unlikely, though GPT-4 was a negative update in that regard.
For that matter, even if you reached an implausible level, it’s still not obvious that you have a problem, given that the implausible capability would still be used for pure text prediction. Generating text that, say, manipulated humans and took over the world would be a prediction error, since no such text would ever arise from any of the sources being predicted. OK, unless it predicts that it’ll find its own output in the training data....
This seems like it’s assuming that the system ends up outer-aligned.
This seems like it’s assuming the conclusion (that reaching dangerous capabilities using these architectures is implausible).
I think that bringing up the extreme difficulty of approximately perfect prediction, with a series of very difficult examples, and treating that as interesting enough to post about, amounts to taking it for granted that it is plausible that these architectures can get very, very good at prediction.
I don’t find that plausible, and I’m sure that there are many, many other people who won’t find it plausible either, once you call their attention to the assumption. The burden of proof falls on the proponent; if Eliezer wants us to worry about it, it’s his job to make it plausible to us.
This seems like it’s assuming that the system ends up outer-aligned.
It might be. I have avoided remembering “alignment” jargon, because every time I’ve looked at it I’ve gotten the strong feeling that the whole ontology is completely wrong, and I don’t want to break my mind by internalizing it.
It assumes that it ends up doing what you were trying to train it to do. That’s not guaranteed, for sure… but on the other hand, it’s not guaranteed that it won’t. I mean, the whole line of argument assumes that it gets incredibly good at what you were trying to train it to do. And all I said was “it’s not obvious that you have a problem”. I was very careful not to say that “you don’t have a problem”.
I agree that the post makes somewhat less sense without the surrounding context (in that it was originally generated as a series of tweets, which I think were mostly responding to people making a variety of mistaken claims about the fundamental limitations of GPT/etc).
Referring back to your top-level comment:
I honestly don’t see the relevance of this.
The relevance should be clear: in the limit of capabilities, such systems could be dangerous. Whether the relevant threshold is reachable via current methods is unknown—I don’t think Eliezer thinks it’s overwhelmingly likely; I myself am uncertain. You do not need a system capable of reversing hashes for that system to be dangerous in the relevant sense. (If you disagree with the entire thesis of AI x-risk then perhaps you disagree with that, but if so, then perhaps mention that up-front, so as to save time arguing about things that aren’t actually cruxy for you?)
But there’s literally no reason to think that the architectures being used can ever get that good at prediction, especially not if they have to meet any realistic size constraint and/or are restricted to any realistically available amount of training input.
Except for the steadily-increasing capabilities they continue to display as they scale? Also my general objection to the phrase “no reason”/”no evidence”; there obviously is evidence, if you think that evidence should be screened off please argue that explicitly.
The relevance should be clear: in the limit of capabilities, such systems could be dangerous.
What I’m saying is that reaching that limit, or reaching any level qualitatively similar to that limit, via that path, is so implausible, at least to me, that I can’t see a lot of point in even devoting more than half a sentence to the possibility, let alone using it as a central hypothesis in your planning. Thus “irrelevant”.
It’s at least somewhat plausible that you could reach a level that was dangerous, but that’s very different from getting anywhere near that limit. For that matter, it’s at least plausible that you could get dangerous just by “imitation” rather than by “prediction”. So, again, why put so much attention into it?
Except for the steadily-increasing capabilities they continue to display as they scale? Also my general objection to the phrase “no reason”/”no evidence”; there obviously is evidence, if you think that evidence should be screened off please argue that explicitly.
OK, there’s not no evidence. There’s just evidence weak enough that I don’t think it’s worth remarking on.
I accept that they’ve scaled a lot better than anybody would have expected even 5 years ago. And I expect them to keep improving for a while.
But...
They’re not so opaque as all that, and they’re still just using basically pure statistics to do their prediction, and they’re still basically doing just prediction, and they’re still operating with finite resources.
When you observe something that looks like an exponential in real life, the right way to bet it is almost always that it’s really a sigmoid.
Whenever you get a significant innovation, you would expect to see a sudden ramp-up in capability, so actually seeing such a ramp-up, even if it’s bigger than you would have expected, shouldn’t cause you to update that much about the final outcome.
If I wanted to find the thing that worries me most, it’d probably be that there’s no rule that somebody building a real system has to keep the architecture pure. Even if you do start to get diminishing returns from “GPTs” and prediction, you don’t have to stop there. If you keep adding more obvious-to-only-somewhat-unintuitive elements to the architecture, you can get in at the bottoms of more sigmoids. And the effects can easily be synergistic. And what we definitely have is a lot of momentum: many smart people’s attention and a lot of money [1] at stake, plus whatever power you get from the tools already built. That kind of thing is how you get those innovations.
You’re making a claim about both:
what sorts of cognitive capabilities can exist in reality, and
whether current (or future) training regimes are likely to find them
It sounds like you agree that the relevant cognitive capabilities are likely to exist, though maybe not for prime number factorization, and that it’s unclear whether they’d fit inside current architectures.
I do not read Eliezer as making a claim that future GPT-n generations will become perfect (or approximately perfect) text predictors. He is specifically rebutting claims others have made, that GPTs/etc can not become ASI, because e.g. they are “merely imitating” human text. This is not obviously true; to the extent that there exist some cognitive capabilities which are physically possible to instantiate in GPT-n model weights which can solve these prediction problems, and are within the region of possible outcomes of our training regimes (+ the data used for them), then it is possible that we will find them.
That may be, but I’m not seeing that context here. It ends up reading to me as “look how powerful a perfect predictor would be, (and? so?) if we keep training them we’re going to end up with a perfect predictor (and, I extrapolate, then we’re hosed)”.
I’m not trying to make any confident claim that GPT-whatever can’t become dangerous[1]. But I don’t think that talking about how powerful GPTs would be if they reached implausible performance levels really says anything at all about whether they’d be dangerous at plausible ones.
For that matter, even if you reached an implausible level, it’s still not obvious that you have a problem, given that the implausible capability would still be used for pure text prediction. Generating text that, say, manipulated humans and took over the world would be a prediction error, since no such text would ever arise from any of the sources being predicted. OK, unless it predicts that it’ll find its own output in the training data....
Although, even with plugins, there are a lot of kinds of non-prediction-like capabilities I’d need to see before I thought a system was obviously dangerous [2].
I love footnotes.
This seems like it’s assuming the conclusion (that reaching dangerous capabilities using these architectures is implausible). Eliezer did consider it unlikely, though GPT-4 was a negative update in that regard.
This seems like it’s assuming that the system ends up outer-aligned.
I think that bringing up the extreme difficulty of approximately perfect prediction, with a series of very difficult examples, and treating that as interesting enough to post about, amounts to taking it for granted that it is plausible that these architectures can get very, very good at prediction.
I don’t find that plausible, and I’m sure that there are many, many other people who won’t find it plausible either, once you call their attention to the assumption. The burden of proof falls on the proponent; if Eliezer wants us to worry about it, it’s his job to make it plausible to us.
It might be. I have avoided remembering “alignment” jargon, because every time I’ve looked at it I’ve gotten the strong feeling that the whole ontology is completely wrong, and I don’t want to break my mind by internalizing it.
It assumes that it ends up doing what you were trying to train it to do. That’s not guaranteed, for sure… but on the other hand, it’s not guaranteed that it won’t. I mean, the whole line of argument assumes that it gets incredibly good at what you were trying to train it to do. And all I said was “it’s not obvious that you have a problem”. I was very careful not to say that “you don’t have a problem”.
I agree that the post makes somewhat less sense without the surrounding context (in that it was originally generated as a series of tweets, which I think were mostly responding to people making a variety of mistaken claims about the fundamental limitations of GPT/etc).
Referring back to your top-level comment:
The relevance should be clear: in the limit of capabilities, such systems could be dangerous. Whether the relevant threshold is reachable via current methods is unknown—I don’t think Eliezer thinks it’s overwhelmingly likely; I myself am uncertain. You do not need a system capable of reversing hashes for that system to be dangerous in the relevant sense. (If you disagree with the entire thesis of AI x-risk then perhaps you disagree with that, but if so, then perhaps mention that up-front, so as to save time arguing about things that aren’t actually cruxy for you?)
Except for the steadily-increasing capabilities they continue to display as they scale? Also my general objection to the phrase “no reason”/”no evidence”; there obviously is evidence, if you think that evidence should be screened off please argue that explicitly.
What I’m saying is that reaching that limit, or reaching any level qualitatively similar to that limit, via that path, is so implausible, at least to me, that I can’t see a lot of point in even devoting more than half a sentence to the possibility, let alone using it as a central hypothesis in your planning. Thus “irrelevant”.
It’s at least somewhat plausible that you could reach a level that was dangerous, but that’s very different from getting anywhere near that limit. For that matter, it’s at least plausible that you could get dangerous just by “imitation” rather than by “prediction”. So, again, why put so much attention into it?
OK, there’s not no evidence. There’s just evidence weak enough that I don’t think it’s worth remarking on.
I accept that they’ve scaled a lot better than anybody would have expected even 5 years ago. And I expect them to keep improving for a while.
But...
They’re not so opaque as all that, and they’re still just using basically pure statistics to do their prediction, and they’re still basically doing just prediction, and they’re still operating with finite resources.
When you observe something that looks like an exponential in real life, the right way to bet it is almost always that it’s really a sigmoid.
Whenever you get a significant innovation, you would expect to see a sudden ramp-up in capability, so actually seeing such a ramp-up, even if it’s bigger than you would have expected, shouldn’t cause you to update that much about the final outcome.
If I wanted to find the thing that worries me most, it’d probably be that there’s no rule that somebody building a real system has to keep the architecture pure. Even if you do start to get diminishing returns from “GPTs” and prediction, you don’t have to stop there. If you keep adding more obvious-to-only-somewhat-unintuitive elements to the architecture, you can get in at the bottoms of more sigmoids. And the effects can easily be synergistic. And what we definitely have is a lot of momentum: many smart people’s attention and a lot of money [1] at stake, plus whatever power you get from the tools already built. That kind of thing is how you get those innovations.
Added on edit: and, maybe worse, prestige…