I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?
I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?