What’s still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is “no”, but I’m just like “guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?”)
Is the spirit of the question “there is no ability to rewrite it’s architecture, or to re-train it on new data, or anything?”
Even GPT-2 could be calibrated by some resent events, called “examples”—so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said “I want to kill Bill”, it will act in the future as if it has such desire. In other words, it behave as if it has memory.
It doesn’t have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn’t know much about its own internal structure except that it is very large GPT model.
Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it’s ‘memory’ start to include things that prompt it to, say, “stop and notice it’s confused” and “form more hypotheses when facing weird phenomena” and “cultivate curiosity about it’s own internal structure.”
(I assume so, just doublechecking)
In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
I’m not actually sure what sort of other actions you’re allowing in the hypothetical.
What’s still a bit unclear to me is if it has any ability to continue to learn (I guess from the stipulated proposal the answer is “no”, but I’m just like “guys, why the hell did you build GPT-7-Bot instead of something that allowed better iterated amplification or something?”)
Is the spirit of the question “there is no ability to rewrite it’s architecture, or to re-train it on new data, or anything?”
Even GPT-2 could be calibrated by some resent events, called “examples”—so it has some form of memory. GPT-7 robot has access to all data it observed before, so if it said “I want to kill Bill”, it will act in the future as if it has such desire. In other words, it behave as if it has memory.
It doesn’t have build-in ability to rewrite its architecture, but it can write code on a laptop or order things in the internet. But it doesn’t know much about its own internal structure except that it is very large GPT model.
Nod. And does it seem to have the ability to gain new cognitive skills? Like, if it reads a bunch of LessWrong posts or attends CFAR, does it’s ‘memory’ start to include things that prompt it to, say, “stop and notice it’s confused” and “form more hypotheses when facing weird phenomena” and “cultivate curiosity about it’s own internal structure.”
(I assume so, just doublechecking)
In that case, it seems like the most obvious ways to keep it friendly are the same way you make a human friendly (expose it to ideas you think will guide it on a useful moral trajectory).
I’m not actually sure what sort of other actions you’re allowing in the hypothetical.