(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.
(“There’s no way” is a claim too strong. My expectation is that there’s a way to train something from scratch using <1% of compute that was used to train either LLMs, that works better.)
But I was talking about sharing the internal representations between the two already trained transformers.