I think that’s more of a sociological question than it is a technical question. The class of problems and the class of algorithms that are considered AI has changed dramatically over the past 50 years
This is actually a pretty good point: when we talk about “AI” instead of “software” or “machines” or whatever, a lot of people think about this narrow thing, which is something like “hypothetical programs on or beyond the edge of scientific / technological development.” And so anyone who can tell you where the edge will be in 50 years is obviously overconfident about their predictive ability!
0:34:48.2 Vael: … Well, it seems like you could train it on one personality if you wanted to, right? If you had enough data for that, which we don’t. But if we did. And then I wouldn’t really worry about it having different agents in it.
0:35:17.6 Interviewee: That’s a very, very, very, very, very, very, very, very large amount of text.
0:35:26.5 Interviewee: Do you any—do you have any scope of understanding for how much text that is?
It seems worth pointing out that my first question was “wait, is it that large?” and then they do the math and it does seem like that many ’very’s is justified.
But some followups:
As Vael points out, you can’t pre-train on the whole internet and then post-train on a single person. [The other personalities are still in there, and won’t be fully extinguished.] But there are presumably other things you can do to ‘extract’ one of the personalities out of there.
To the extent that there is a ‘style’ that you want to capture, rather than a particular person’s “personality”, quite plausibly there are a billion pages written in that style. (There are probably a billion pages of legalese?)
While it’s compute-efficient to get more data (instead of running more passes over that data), if you want to be in a data-limited regime (like just using the text corpus output by Isaac Asimov), presumably you can do many passes over that data.
Humans are somehow doing something much more data-efficient than GPT-3 is. (I definitely read less than five billion pages as part of learning how to write!) While the Interviewee seems pretty confident that’s related to being able to interact with the world, I guess I’m not convinced, or am imagining that imaginary interaction will be able to generate significant data.
So a pressingly important question is, to what extent does this interfere with… Let’s, to make language easier, call it one of its personalities. Let’s say one of its personalities wants to do something in the world: kill all the humans or even something mundane. To what extent does the fact that it’s not the only personality interfere with its ability to create and execute plans?
This whole section (I’m quoting a typical passage near the start) seems pretty good. Basically, the Interviewee is reacting to people worried that GPT-3 is going to ‘wake up’ a la Skynet with: “have you seen how crazily incoherent this thing is?”
Somehow this section reminded me of some recent Eliezer stuff, which I’ll summarize as saying that he’s pretty confident in his models of what AI will look like eventually, and not what it’ll look like soon; as you rise thru the atmosphere, there’s a bunch of turbulence, and only after you get thru that is the flight path smooth and predictable.
In this situation, GPT-3 seems like it has lots of personalities and switches between them in an undeliberate way; but presumably the CEO-bot is trying to be coherent, instead of trying to be able to plausibly generate any of the text that it saw on the internet. [The interviewee notes that this incoherency causes capability problems; presumably systems will become more coherent as part of the standard push towards capability!]
[I should note that I think this focuses too much on the human-visible ‘incoherence’ to be really reassuring that there’s not some sort of coherency being trained where we haven’t figured out where to look for it yet.]
To clarify a little, Vael initially suggests that you could train GPT-3 from scratch on one human’s output to get a safe imitation of 1 specific agent (that human), without any further weirdness. This does seem obviously wrong: there is probably more than enough information in that output to recover the human’s personality etc, but one human’s lifetime output of text clearly does not encode everything they have learned about the world and is radically inadequate. Sample-efficiency of DL is besides the point, the data just is not there—I have learned far more about, say, George Washington than I have ever written down (because why would I do that?) and no model trained from scratch on my writings will know as much about George Washington as I do.
However, this argument is narrow and only works for text and text outputs. Text outputs may be inadequate, but those two words immediately suggest a response: What about my text inputs? Or inputs beyond text? Multimodal models, which are already enjoying much success now, are the most obvious loophole here. Obviously, if you had all of one specific human’s inputs, you do have enough data to train an agent ‘from scratch’, because that is how that human grew up to be an agent! It is certainly not absurd to imagine recording a lifetime of video footage and body motion, and there are already the occasional fun lifelogging projects which try to do similar things, such as to study child development. Since everything I learned about George Washington I learned from using my eyes to look at things or my ears to listen to people, video footage of that would indeed enable a model to know as much about George Washington as I do.
Unfortunately, that move immediately brings back all of the safety questions: you are now training the model on all of the inputs that human has been exposed to throughout their lifetime, including all the Internet text written by other people. All of these inputs are going to be modeled by the human, and by the model you are training on those inputs. So the ‘multiple personality’ issue comes right back. In humans, we typically have a strong enough sense of identity that words spoken by other humans can’t erase or hijack our identity… typically. (Your local Jesus or Napoleon might beg to differ.) With models, there’s no particular constraint by default from a giant pile of parameters learning to predict. If you want that robustness, you’re going to have to figure out how to engineer it.
I disagree with their characterization of DRL, which is highly pessimistic and in parts factually wrong (eg. I’ve seen plenty of robot policy transfer).
I agree with them about thinking of GPT-3 as an ensemble of agents randomly sampled from the Internet, but I think they are mostly wrong about how hard coherency/consistency is or how necessary it is; it doesn’t strike me as all that hard or important, much less as the most critical and important limitation of all.
Of course starting with an empty prompt will be gibberish incoherent, since you’re unlikely to sample the same agent twice, but the prompt can easily keep identity on track and if you pick the same agent, it can be coherent (or at least, it seems to me that the interviewee is relying heavily on the ‘crowdsourced distribution’ being incoherent with itself—which of course it is, just as humanity as a whole is deeply incoherent—but punting on the real question which is whether any agent GPT-3 can execute can be coherent, which is either yes or seems like further scaling/improvements would render more coherent). GPT-3 is inferring a latent variable in the POMDP of modeling people; it doesn’t take much evidence to update the variable to high confidence. (Proof: type in “Hitler says”—or “Gwern says*”—to your friendly local GPT-3-scale model. That’s all of… <40 bits? or like 4 tokens.) The more history it has, the more it is conditioning on in inferring which MDP it is in and who it is. This prompt or history could be hardwired too, note, it doesn’t even have to be text, look at all the research on developing continuous prompts or very lightweight finetuning.
Also, consistency of agent may be overrated given how extremely convergent goals are given circumstances (Veedrac takes this further than my Clippy story). The real world involves a few major decisions which determine much of your life, filled in by obvious decisions implementing the big ones or which are essentially irrelevant like deciding what movie to watch to unwind tonight; the big ones, being so few, are easy to make coherent, and the little ones are either so obvious that agents would have to be extremely incoherent to diverge or it doesn’t matter. If you wake up in a CEO’s chair and can do anything, nevertheless, the most valuable thing is probably going to involve playing your role as CEO and deal with the problem your subordinate just brought you; the decision that mattered, where agents might disagree, was the one that made you CEO in the first place, but now that is a done deal. Or more concretely: I can, for example, predict with great confidence that the majority of humans on earth, if they were somehow teleported into my head right now like some body-swap Saturday morning cartoon, would shortly head to the bathroom, and this is the result of a decision I made several hours ago involving tea; and I can predict with even greater confidence that they will do so by standing up and walking into the hallway and walking through the doorway (as opposed to all the other actions they could have taken, like wriggle over the floor like a snake or try to thrust their head through the wall). A GPT-3 which is instantiating a ‘cloud’ of agents around a prompt may be externally indistinguishable from a ‘single’ human-like agent (let’s pretend that humans are the same exact agent every day and are never inconsistent...), because they all want fairly similar things and when instantiated, all wind up making pretty much the same exact choices, with the variations and inconsistencies being on minor things like what to have for breakfast.
(It’s too bad Vael didn’t ask what those suggested experiments were, it might’ve shed a lot of light on what the interviewee thinks. We might not disagree as much as I think we do.)
* if you were wondering, first completion in playground goes
Gwern says: I found [this](https://www.reddit.com/r/rational/comments/b0vu8z/the_sequences_an_evergrowing_rationalist_bible/) on reddit. It’s a collection of sequences, which are basically essays written by Scott Alexander. He’s a psychiatrist who writes a lot about rationality, and these sequences are basically his attempt to explain the basics of rationality to people. I found them really helpful, and I thought other people might find them helpful too.
It may not have located me in an extremely precise way, but how hard do you think agents sampled from this learned distribution of pseudo-gwerns would find it to coordinate with each other or to continue projects? Probably not very. And to the extent they don’t, why couldn’t that be narrowed down by a history of ‘gwern’ declaring what his plans are and what he is working on at that moment, which each instantiated agent will condition on when it wakes up and try to predict what ‘gwern’ would do in such a situation?
Commentary on the 7ujun trancript:
This is actually a pretty good point: when we talk about “AI” instead of “software” or “machines” or whatever, a lot of people think about this narrow thing, which is something like “hypothetical programs on or beyond the edge of scientific / technological development.” And so anyone who can tell you where the edge will be in 50 years is obviously overconfident about their predictive ability!
It seems worth pointing out that my first question was “wait, is it that large?” and then they do the math and it does seem like that many ’very’s is justified.
But some followups:
As Vael points out, you can’t pre-train on the whole internet and then post-train on a single person. [The other personalities are still in there, and won’t be fully extinguished.] But there are presumably other things you can do to ‘extract’ one of the personalities out of there.
To the extent that there is a ‘style’ that you want to capture, rather than a particular person’s “personality”, quite plausibly there are a billion pages written in that style. (There are probably a billion pages of legalese?)
While it’s compute-efficient to get more data (instead of running more passes over that data), if you want to be in a data-limited regime (like just using the text corpus output by Isaac Asimov), presumably you can do many passes over that data.
Humans are somehow doing something much more data-efficient than GPT-3 is. (I definitely read less than five billion pages as part of learning how to write!) While the Interviewee seems pretty confident that’s related to being able to interact with the world, I guess I’m not convinced, or am imagining that imaginary interaction will be able to generate significant data.
This whole section (I’m quoting a typical passage near the start) seems pretty good. Basically, the Interviewee is reacting to people worried that GPT-3 is going to ‘wake up’ a la Skynet with: “have you seen how crazily incoherent this thing is?”
Somehow this section reminded me of some recent Eliezer stuff, which I’ll summarize as saying that he’s pretty confident in his models of what AI will look like eventually, and not what it’ll look like soon; as you rise thru the atmosphere, there’s a bunch of turbulence, and only after you get thru that is the flight path smooth and predictable.
In this situation, GPT-3 seems like it has lots of personalities and switches between them in an undeliberate way; but presumably the CEO-bot is trying to be coherent, instead of trying to be able to plausibly generate any of the text that it saw on the internet. [The interviewee notes that this incoherency causes capability problems; presumably systems will become more coherent as part of the standard push towards capability!]
[I should note that I think this focuses too much on the human-visible ‘incoherence’ to be really reassuring that there’s not some sort of coherency being trained where we haven’t figured out where to look for it yet.]
To clarify a little, Vael initially suggests that you could train GPT-3 from scratch on one human’s output to get a safe imitation of 1 specific agent (that human), without any further weirdness. This does seem obviously wrong: there is probably more than enough information in that output to recover the human’s personality etc, but one human’s lifetime output of text clearly does not encode everything they have learned about the world and is radically inadequate. Sample-efficiency of DL is besides the point, the data just is not there—I have learned far more about, say, George Washington than I have ever written down (because why would I do that?) and no model trained from scratch on my writings will know as much about George Washington as I do.
However, this argument is narrow and only works for text and text outputs. Text outputs may be inadequate, but those two words immediately suggest a response: What about my text inputs? Or inputs beyond text? Multimodal models, which are already enjoying much success now, are the most obvious loophole here. Obviously, if you had all of one specific human’s inputs, you do have enough data to train an agent ‘from scratch’, because that is how that human grew up to be an agent! It is certainly not absurd to imagine recording a lifetime of video footage and body motion, and there are already the occasional fun lifelogging projects which try to do similar things, such as to study child development. Since everything I learned about George Washington I learned from using my eyes to look at things or my ears to listen to people, video footage of that would indeed enable a model to know as much about George Washington as I do.
Unfortunately, that move immediately brings back all of the safety questions: you are now training the model on all of the inputs that human has been exposed to throughout their lifetime, including all the Internet text written by other people. All of these inputs are going to be modeled by the human, and by the model you are training on those inputs. So the ‘multiple personality’ issue comes right back. In humans, we typically have a strong enough sense of identity that words spoken by other humans can’t erase or hijack our identity… typically. (Your local Jesus or Napoleon might beg to differ.) With models, there’s no particular constraint by default from a giant pile of parameters learning to predict. If you want that robustness, you’re going to have to figure out how to engineer it.
I disagree with their characterization of DRL, which is highly pessimistic and in parts factually wrong (eg. I’ve seen plenty of robot policy transfer).
I agree with them about thinking of GPT-3 as an ensemble of agents randomly sampled from the Internet, but I think they are mostly wrong about how hard coherency/consistency is or how necessary it is; it doesn’t strike me as all that hard or important, much less as the most critical and important limitation of all.
Of course starting with an empty prompt will be gibberish incoherent, since you’re unlikely to sample the same agent twice, but the prompt can easily keep identity on track and if you pick the same agent, it can be coherent (or at least, it seems to me that the interviewee is relying heavily on the ‘crowdsourced distribution’ being incoherent with itself—which of course it is, just as humanity as a whole is deeply incoherent—but punting on the real question which is whether any agent GPT-3 can execute can be coherent, which is either yes or seems like further scaling/improvements would render more coherent). GPT-3 is inferring a latent variable in the POMDP of modeling people; it doesn’t take much evidence to update the variable to high confidence. (Proof: type in “Hitler says”—or “Gwern says*”—to your friendly local GPT-3-scale model. That’s all of… <40 bits? or like 4 tokens.) The more history it has, the more it is conditioning on in inferring which MDP it is in and who it is. This prompt or history could be hardwired too, note, it doesn’t even have to be text, look at all the research on developing continuous prompts or very lightweight finetuning.
Also, consistency of agent may be overrated given how extremely convergent goals are given circumstances (Veedrac takes this further than my Clippy story). The real world involves a few major decisions which determine much of your life, filled in by obvious decisions implementing the big ones or which are essentially irrelevant like deciding what movie to watch to unwind tonight; the big ones, being so few, are easy to make coherent, and the little ones are either so obvious that agents would have to be extremely incoherent to diverge or it doesn’t matter. If you wake up in a CEO’s chair and can do anything, nevertheless, the most valuable thing is probably going to involve playing your role as CEO and deal with the problem your subordinate just brought you; the decision that mattered, where agents might disagree, was the one that made you CEO in the first place, but now that is a done deal. Or more concretely: I can, for example, predict with great confidence that the majority of humans on earth, if they were somehow teleported into my head right now like some body-swap Saturday morning cartoon, would shortly head to the bathroom, and this is the result of a decision I made several hours ago involving tea; and I can predict with even greater confidence that they will do so by standing up and walking into the hallway and walking through the doorway (as opposed to all the other actions they could have taken, like wriggle over the floor like a snake or try to thrust their head through the wall). A GPT-3 which is instantiating a ‘cloud’ of agents around a prompt may be externally indistinguishable from a ‘single’ human-like agent (let’s pretend that humans are the same exact agent every day and are never inconsistent...), because they all want fairly similar things and when instantiated, all wind up making pretty much the same exact choices, with the variations and inconsistencies being on minor things like what to have for breakfast.
(It’s too bad Vael didn’t ask what those suggested experiments were, it might’ve shed a lot of light on what the interviewee thinks. We might not disagree as much as I think we do.)
* if you were wondering, first completion in playground goes
It may not have located me in an extremely precise way, but how hard do you think agents sampled from this learned distribution of pseudo-gwerns would find it to coordinate with each other or to continue projects? Probably not very. And to the extent they don’t, why couldn’t that be narrowed down by a history of ‘gwern’ declaring what his plans are and what he is working on at that moment, which each instantiated agent will condition on when it wakes up and try to predict what ‘gwern’ would do in such a situation?