From GPT to AGI
Epistemic status: Shower thoughts / I have an idea, but not much knowledge
While the the smallest GPT-3 model (125M) has 12 attention layers, each with 12x 64-dimension heads, the largest GPT-3 model (175B) uses 96 attention layers, each with 96x 128-dimension heads.
I would expect that with increased model size it will be possible to increase the attention field by a lot without much need for additional AI insight.
In a guide for AI dungeon there’s a description of a pin item that serves as the context for the story. GPT-3 seems to be good at understanding the goals that are set in the pin item to the point that it tries to achieve them faster then the person who wrote the guide desires.
If the attention field would be larger it would allow for multiple pin items and each pin item could be even larger.
In AI dungeon GPT-3 can’t influence the content of the pin items but it would be possible to give the GPT-3 the ability to use console commands to write into pin items to be able to have memory abilities similar to the short-term and middle term memory that humans have.
The ability to interact with the unix console was already shown Natural Language Shell example on the OpenAI website.
At the beginning a resulting agent could be mentored like AI dungeon has mentoring of the AI. If the AI is able to query Google via the console, I would imagine that it could be effective at many tasks.
A lot of newspaper articles could be written by such an agent who scouts the internet for other information that’s published on the topic and synthesizes available information. Cyberwar could also be done by such agents.
I’d be happy to hear what other people think about such an agent.
It’s not model size/parameters, it’s the cost of the self-attention at runtime.The number of parameters to expand self-attention is linear but the runtime memory consumption goes up quadratically. Even a GPT-2-117M can use up to like 300GB RAM if you increase the window to 30k. You need more efficient attention or alternative architectures.
What exactly is 30k? When I try to calculate the value for GPT-3 it seems to me like 96 * 96 * 128 = 1179648 (1179k) is the resulting value for the 350 GB model.
As explained in the link, that is the size of the context window; past 30k, even TPU pod RAM is too small to run 117M with wider context windows as the RAM usage continues to explode quadratically.
I’m not sure what your calculation is supposed to be.
The unit for your 30k seems to be BPEs (Byte pair encodings).
I found on https://www.gwern.net/GPT-3#dialogue:
If GPT-2 could have a context window of 30k BPEs with 300GB ram, could GPT-3 also have such a context window length? So it could be made 15 times as big as it’s currently?
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.
FailedSave (the author of that guide) here, thanks for reading!
My understanding of that pin item is that it AI Dungeon has a certain amount of text it’s going to feed into the model at every step. The pin item doesn’t increase that size. The contents of the pin item take first priority, but they go at the “back” of the text block. So the disadvantages of using it are that you get less of the story fed in, and if you have a particular style you use for the pin item (short choppy sentences, say) that affects the style of the output as well. So adding more pin items is the same difficulty as adding any other amount of input.
(The AI also happens to think things in the pin item are relevant, so if a “new” character gets introduced or mentioned, they may get a name that’s used in the pin item.)
I’m sure others here, and probably gwern in particular, have a better understanding of both GPT-3 and how the game interacts with it; I’m curious to know how close my understanding of the system matches what’s actually going on.
To engage more directly with what you’re suggesting: You seem to be suggesting that the AI could read text, figure out the most important part, and feed it to itself to improve further outputs. Without a highly supervised step that not only figures out the most important part but optimizes it to be used effectively by the AI, I don’t think that would be effective. Already the AI tends to get “stuck”—to pick up on a particular phrase or pattern as very important and keep repeating it—in situations what that’s unexpected or unwanted.