To clarify a little, Vael initially suggests that you could train GPT-3 from scratch on one human’s output to get a safe imitation of 1 specific agent (that human), without any further weirdness. This does seem obviously wrong: there is probably more than enough information in that output to recover the human’s personality etc, but one human’s lifetime output of text clearly does not encode everything they have learned about the world and is radically inadequate. Sample-efficiency of DL is besides the point, the data just is not there—I have learned far more about, say, George Washington than I have ever written down (because why would I do that?) and no model trained from scratch on my writings will know as much about George Washington as I do.
However, this argument is narrow and only works for text and text outputs. Text outputs may be inadequate, but those two words immediately suggest a response: What about my text inputs? Or inputs beyond text? Multimodal models, which are already enjoying much success now, are the most obvious loophole here. Obviously, if you had all of one specific human’s inputs, you do have enough data to train an agent ‘from scratch’, because that is how that human grew up to be an agent! It is certainly not absurd to imagine recording a lifetime of video footage and body motion, and there are already the occasional fun lifelogging projects which try to do similar things, such as to study child development. Since everything I learned about George Washington I learned from using my eyes to look at things or my ears to listen to people, video footage of that would indeed enable a model to know as much about George Washington as I do.
Unfortunately, that move immediately brings back all of the safety questions: you are now training the model on all of the inputs that human has been exposed to throughout their lifetime, including all the Internet text written by other people. All of these inputs are going to be modeled by the human, and by the model you are training on those inputs. So the ‘multiple personality’ issue comes right back. In humans, we typically have a strong enough sense of identity that words spoken by other humans can’t erase or hijack our identity… typically. (Your local Jesus or Napoleon might beg to differ.) With models, there’s no particular constraint by default from a giant pile of parameters learning to predict. If you want that robustness, you’re going to have to figure out how to engineer it.
I disagree with their characterization of DRL, which is highly pessimistic and in parts factually wrong (eg. I’ve seen plenty of robot policy transfer).
I agree with them about thinking of GPT-3 as an ensemble of agents randomly sampled from the Internet, but I think they are mostly wrong about how hard coherency/consistency is or how necessary it is; it doesn’t strike me as all that hard or important, much less as the most critical and important limitation of all.
Of course starting with an empty prompt will be gibberish incoherent, since you’re unlikely to sample the same agent twice, but the prompt can easily keep identity on track and if you pick the same agent, it can be coherent (or at least, it seems to me that the interviewee is relying heavily on the ‘crowdsourced distribution’ being incoherent with itself—which of course it is, just as humanity as a whole is deeply incoherent—but punting on the real question which is whether any agent GPT-3 can execute can be coherent, which is either yes or seems like further scaling/improvements would render more coherent). GPT-3 is inferring a latent variable in the POMDP of modeling people; it doesn’t take much evidence to update the variable to high confidence. (Proof: type in “Hitler says”—or “Gwern says*”—to your friendly local GPT-3-scale model. That’s all of… <40 bits? or like 4 tokens.) The more history it has, the more it is conditioning on in inferring which MDP it is in and who it is. This prompt or history could be hardwired too, note, it doesn’t even have to be text, look at all the research on developing continuous prompts or very lightweight finetuning.
Also, consistency of agent may be overrated given how extremely convergent goals are given circumstances (Veedrac takes this further than my Clippy story). The real world involves a few major decisions which determine much of your life, filled in by obvious decisions implementing the big ones or which are essentially irrelevant like deciding what movie to watch to unwind tonight; the big ones, being so few, are easy to make coherent, and the little ones are either so obvious that agents would have to be extremely incoherent to diverge or it doesn’t matter. If you wake up in a CEO’s chair and can do anything, nevertheless, the most valuable thing is probably going to involve playing your role as CEO and deal with the problem your subordinate just brought you; the decision that mattered, where agents might disagree, was the one that made you CEO in the first place, but now that is a done deal. Or more concretely: I can, for example, predict with great confidence that the majority of humans on earth, if they were somehow teleported into my head right now like some body-swap Saturday morning cartoon, would shortly head to the bathroom, and this is the result of a decision I made several hours ago involving tea; and I can predict with even greater confidence that they will do so by standing up and walking into the hallway and walking through the doorway (as opposed to all the other actions they could have taken, like wriggle over the floor like a snake or try to thrust their head through the wall). A GPT-3 which is instantiating a ‘cloud’ of agents around a prompt may be externally indistinguishable from a ‘single’ human-like agent (let’s pretend that humans are the same exact agent every day and are never inconsistent...), because they all want fairly similar things and when instantiated, all wind up making pretty much the same exact choices, with the variations and inconsistencies being on minor things like what to have for breakfast.
(It’s too bad Vael didn’t ask what those suggested experiments were, it might’ve shed a lot of light on what the interviewee thinks. We might not disagree as much as I think we do.)
* if you were wondering, first completion in playground goes
Gwern says: I found [this](https://www.reddit.com/r/rational/comments/b0vu8z/the_sequences_an_evergrowing_rationalist_bible/) on reddit. It’s a collection of sequences, which are basically essays written by Scott Alexander. He’s a psychiatrist who writes a lot about rationality, and these sequences are basically his attempt to explain the basics of rationality to people. I found them really helpful, and I thought other people might find them helpful too.
It may not have located me in an extremely precise way, but how hard do you think agents sampled from this learned distribution of pseudo-gwerns would find it to coordinate with each other or to continue projects? Probably not very. And to the extent they don’t, why couldn’t that be narrowed down by a history of ‘gwern’ declaring what his plans are and what he is working on at that moment, which each instantiated agent will condition on when it wakes up and try to predict what ‘gwern’ would do in such a situation?
To clarify a little, Vael initially suggests that you could train GPT-3 from scratch on one human’s output to get a safe imitation of 1 specific agent (that human), without any further weirdness. This does seem obviously wrong: there is probably more than enough information in that output to recover the human’s personality etc, but one human’s lifetime output of text clearly does not encode everything they have learned about the world and is radically inadequate. Sample-efficiency of DL is besides the point, the data just is not there—I have learned far more about, say, George Washington than I have ever written down (because why would I do that?) and no model trained from scratch on my writings will know as much about George Washington as I do.
However, this argument is narrow and only works for text and text outputs. Text outputs may be inadequate, but those two words immediately suggest a response: What about my text inputs? Or inputs beyond text? Multimodal models, which are already enjoying much success now, are the most obvious loophole here. Obviously, if you had all of one specific human’s inputs, you do have enough data to train an agent ‘from scratch’, because that is how that human grew up to be an agent! It is certainly not absurd to imagine recording a lifetime of video footage and body motion, and there are already the occasional fun lifelogging projects which try to do similar things, such as to study child development. Since everything I learned about George Washington I learned from using my eyes to look at things or my ears to listen to people, video footage of that would indeed enable a model to know as much about George Washington as I do.
Unfortunately, that move immediately brings back all of the safety questions: you are now training the model on all of the inputs that human has been exposed to throughout their lifetime, including all the Internet text written by other people. All of these inputs are going to be modeled by the human, and by the model you are training on those inputs. So the ‘multiple personality’ issue comes right back. In humans, we typically have a strong enough sense of identity that words spoken by other humans can’t erase or hijack our identity… typically. (Your local Jesus or Napoleon might beg to differ.) With models, there’s no particular constraint by default from a giant pile of parameters learning to predict. If you want that robustness, you’re going to have to figure out how to engineer it.
I disagree with their characterization of DRL, which is highly pessimistic and in parts factually wrong (eg. I’ve seen plenty of robot policy transfer).
I agree with them about thinking of GPT-3 as an ensemble of agents randomly sampled from the Internet, but I think they are mostly wrong about how hard coherency/consistency is or how necessary it is; it doesn’t strike me as all that hard or important, much less as the most critical and important limitation of all.
Of course starting with an empty prompt will be gibberish incoherent, since you’re unlikely to sample the same agent twice, but the prompt can easily keep identity on track and if you pick the same agent, it can be coherent (or at least, it seems to me that the interviewee is relying heavily on the ‘crowdsourced distribution’ being incoherent with itself—which of course it is, just as humanity as a whole is deeply incoherent—but punting on the real question which is whether any agent GPT-3 can execute can be coherent, which is either yes or seems like further scaling/improvements would render more coherent). GPT-3 is inferring a latent variable in the POMDP of modeling people; it doesn’t take much evidence to update the variable to high confidence. (Proof: type in “Hitler says”—or “Gwern says*”—to your friendly local GPT-3-scale model. That’s all of… <40 bits? or like 4 tokens.) The more history it has, the more it is conditioning on in inferring which MDP it is in and who it is. This prompt or history could be hardwired too, note, it doesn’t even have to be text, look at all the research on developing continuous prompts or very lightweight finetuning.
Also, consistency of agent may be overrated given how extremely convergent goals are given circumstances (Veedrac takes this further than my Clippy story). The real world involves a few major decisions which determine much of your life, filled in by obvious decisions implementing the big ones or which are essentially irrelevant like deciding what movie to watch to unwind tonight; the big ones, being so few, are easy to make coherent, and the little ones are either so obvious that agents would have to be extremely incoherent to diverge or it doesn’t matter. If you wake up in a CEO’s chair and can do anything, nevertheless, the most valuable thing is probably going to involve playing your role as CEO and deal with the problem your subordinate just brought you; the decision that mattered, where agents might disagree, was the one that made you CEO in the first place, but now that is a done deal. Or more concretely: I can, for example, predict with great confidence that the majority of humans on earth, if they were somehow teleported into my head right now like some body-swap Saturday morning cartoon, would shortly head to the bathroom, and this is the result of a decision I made several hours ago involving tea; and I can predict with even greater confidence that they will do so by standing up and walking into the hallway and walking through the doorway (as opposed to all the other actions they could have taken, like wriggle over the floor like a snake or try to thrust their head through the wall). A GPT-3 which is instantiating a ‘cloud’ of agents around a prompt may be externally indistinguishable from a ‘single’ human-like agent (let’s pretend that humans are the same exact agent every day and are never inconsistent...), because they all want fairly similar things and when instantiated, all wind up making pretty much the same exact choices, with the variations and inconsistencies being on minor things like what to have for breakfast.
(It’s too bad Vael didn’t ask what those suggested experiments were, it might’ve shed a lot of light on what the interviewee thinks. We might not disagree as much as I think we do.)
* if you were wondering, first completion in playground goes
It may not have located me in an extremely precise way, but how hard do you think agents sampled from this learned distribution of pseudo-gwerns would find it to coordinate with each other or to continue projects? Probably not very. And to the extent they don’t, why couldn’t that be narrowed down by a history of ‘gwern’ declaring what his plans are and what he is working on at that moment, which each instantiated agent will condition on when it wakes up and try to predict what ‘gwern’ would do in such a situation?