Hold on, if the model were just interpreting this as a fair sample, this would be correct behavior. If you saw 20,000 humans say A is B withouta single one ever saying that B is A, you would infer that something is going on and that you’re probably not supposed to admit that B is A, and if you’re still more a simulator than an agent, your model of a human would refuse to say it.
Do the tests address this? Or do they need to? (I don’t feel like I have an intuitive handle on how LLMs learn anything btw)
Hold on, if the model were just interpreting this as a fair sample, this would be correct behavior. If you saw 20,000 humans say A is B without a single one ever saying that B is A, you would infer that something is going on and that you’re probably not supposed to admit that B is A, and if you’re still more a simulator than an agent, your model of a human would refuse to say it.
Do the tests address this? Or do they need to? (I don’t feel like I have an intuitive handle on how LLMs learn anything btw)