If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.
If you tweaked GPT-3 (let’s assume the total parameter count remained the same so layers were made a little narrower or somesuch) to have a 30k BPE context, I think the RAM requirements would explode to the point where even the small layers couldn’t fit their forward pass onto a single GPU. You can forget about training it too.