I’m a CS master’s student at ENS Paris-Saclay. I want to pursue a career in AI safety research.
MATS 7.0 Scholar in Neel Nanda’s stream
I’m a CS master’s student at ENS Paris-Saclay. I want to pursue a career in AI safety research.
MATS 7.0 Scholar in Neel Nanda’s stream
Would you expect that we can extract xors from small models like pythia-70m under your hypothesis?
I disagree; it could be beneficial for a base model to identify when a character is making false claims, enabling the prediction of such claims in the future.
Let’s assume the prompt template is Q [true/false] [banana/shred]
If I understand correctly, they don’t claim learned has_banana but learned has_banana. Moreover evaluating for gives:
Therefore, we can learn a that is a banana classifier
Small typo in ## Interference arbiters collisions between features
by taking aninner productt with .
Hi Nathan, I’m not sure if I understand your critique correctly. The algorithm we describe does not try to “maximize the expected likelihood of harvesting X apples”. It tries to find a policy that, given its current knowledge of the world, will achieve an expected return of X apples. That is, it does not care about the probability of getting exactly X apples, but rather the average number of apples it will get over many trials. Does that make sense?
You can get ~75% just by computing the or. But we found that only at the last layer and step16000 of Pythia-70m training it achieves better than 75%, see this video