Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video
Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
Yeah I’d expect some degree of interference leading to >50% success on XORs even in small models.
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video