What’s the TL;DR for the Vicuna 13B experiments?
Activation additions work on Vicuna-13B about as well as they work on GPT-2-XL, or perhaps slightly better. GPT-J-6B is harder to work with for some reason.
Note that there’s still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:
I added m1,000 in liquidity.
This idea of determining whether a result is “obvious” in advance seems valuable, I hope it catches on.
I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?
Activation additions work on Vicuna-13B about as well as they work on GPT-2-XL, or perhaps slightly better. GPT-J-6B is harder to work with for some reason.
Note that there’s still a market open for how activation additions interact with larger models, it would be nice if it had more liquidity:
I added m1,000 in liquidity.
This idea of determining whether a result is “obvious” in advance seems valuable, I hope it catches on.
I wonder if this is related to how GPT-J runs the attention and MLP sublayers in parallel, as opposed to sequentially?