Can you give more details on how it works? I’m imagining that it has some algorithm for detecting whether a command has been fulfilled, and it is rewarded partially for accurate predictions and partially for fulfilled commands? That means there must be some algorithm that detects whether a command has been fulfilled? How is that algorithm built or trained?
It works the same way as GPT makes TL;DR summaries. There is no any reward for correct TL;DR or any training—it just completes sequence in the most probable way. The same way some self-driving cars work: there is a neural net from end to end, without any internal world models, an it just predicts what a normal car will do in this situation. I heard from a ML friend that they could achieve some reasonably good driving with such models.
Oh, OK. So perhaps we give it a humanoid robot body, so that it is as similar as possible to the humans in its dataset, and then we set up the motors so that the body does whatever GPT-7 predicts it will do, and GPT-7 is trained on datasets of human videos (say) so if you ask it to bring the coffee it probably will? Thanks, this is much clearer now.
What’s still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is “no”, but I’m just like “guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?”)
Is the spirit of the question “there is no ability to rewrite it’s architecture, or to re-train it on new data, or anything?”
Even GPT-2 could be calibrated by some resent events, called “examples”—so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said “I want to kill Bill”, it will act in the future as if it has such desire. In other words, it behave as if it has memory.
It doesn’t have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn’t know much about its own internal structure except that it is very large GPT model.
Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it’s ‘memory’ start to include things that prompt it to, say, “stop and notice it’s confused” and “form more hypotheses when facing weird phenomena” and “cultivate curiosity about it’s own internal structure.”
(I assume so, just doublechecking)
In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
I’m not actually sure what sort of other actions you’re allowing in the hypothetical.
Can you give more details on how it works? I’m imagining that it has some algorithm for detecting whether a command has been fulfilled, and it is rewarded partially for accurate predictions and partially for fulfilled commands? That means there must be some algorithm that detects whether a command has been fulfilled? How is that algorithm built or trained?
It works the same way as GPT makes TL;DR summaries. There is no any reward for correct TL;DR or any training—it just completes sequence in the most probable way. The same way some self-driving cars work: there is a neural net from end to end, without any internal world models, an it just predicts what a normal car will do in this situation. I heard from a ML friend that they could achieve some reasonably good driving with such models.
Oh, OK. So perhaps we give it a humanoid robot body, so that it is as similar as possible to the humans in its dataset, and then we set up the motors so that the body does whatever GPT-7 predicts it will do, and GPT-7 is trained on datasets of human videos (say) so if you ask it to bring the coffee it probably will? Thanks, this is much clearer now.
What’s still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is “no”, but I’m just like “guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?”)
Is the spirit of the question “there is no ability to rewrite it’s architecture, or to re-train it on new data, or anything?”
Even GPT-2 could be calibrated by some resent events, called “examples”—so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said “I want to kill Bill”, it will act in the future as if it has such desire. In other words, it behave as if it has memory.
It doesn’t have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn’t know much about its own internal structure except that it is very large GPT model.
Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it’s ‘memory’ start to include things that prompt it to, say, “stop and notice it’s confused” and “form more hypotheses when facing weird phenomena” and “cultivate curiosity about it’s own internal structure.”
(I assume so, just doublechecking)
In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
I’m not actually sure what sort of other actions you’re allowing in the hypothetical.