I haven’t read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it’s quite intractable to do introspection science when we don’t have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or “true preferences” introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]
I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won’t have a chance to write up thoughts here more cleanly for a few days.
ETA: Briefly, the key points are:
Honesty issues for introspection aren’t obviously much worse than they are for simple probing. (But fair if you’re already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it’s probably quite difficult for a model to tell on which inputs it can get away with lying.
Probably I misunderstood your concern. I interpreted your concern about settings where we don’t have access to ground truth as relating to cases where the model could lie about its inner states without us being able to tell (because of lack of ground truth). But maybe you’re more worried about being able to develop a (sufficiently diverse) introspection training signal in the first place?
I’ll also note that I’m approaching this from the angle of “does introspection have worse problems with lack-of-ground-truth than traditional interpretability?” where I think the answer isn’t that clear without thinking about it more. Traditional interpretability often hill-climbs on “producing explanations that seem plausible” (instead of hill climbing on ground-truth explanations, which we almost never have access to), and I’m not sure whether this poses more of a problem for traditional interpretability vs. black-box approaches like introspection.
I think ground-truth is more expensive, noisy, and contentious as you get to questions like “What are your goals?” or “Do you have feelings?”. I still think it’s possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It’s possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.
Where we have groundtruth: We have groundtruth for questions like the ones we study above (about properties of model behavior on a given prompt), and for questions like “Would you answer question [hard math question] correctly?”. This can be extended to other counterfactual questions like “Suppose three words were deleted from this [text]. Which choice of three words have most change your rating of the quality of the text?”
Where groundtruth is more expensive and/or less clearcut. E.g. “Would you answer question [history exam question] correctly?”. Or questions about which concepts the model is using to solve a problem, what the model’s goals or references are. I still think we can gather evidence that makes answers to these questions more or less likely—esp. if we average over a large set of such questions.
Our original thinking was along the lines of: we’re interested in introspection. But introspection about inner states is hard to evaluate, since interpretability is not good enough to determine whether a statement of an LLM about its inner states is true. Additionally, it could be the case that a model can introspect on its inner states, but no language exists by which it can be expressed (possibly since its different from human inner states). So we have to ground it in something measurable. And the measurable thing we ground it in is knowledge of ones own behavior. In order to predict behavior, the model has to have access to some information about itself, even if it can’t necessarily express it. But we can measure whether it can employ ti for some other goal (in this case, self-prediction).
It’s true that the particular questions that we ask it could be answered with a pretty narrow form of self-knowledge (namely, internal self-simulation + reasoning about the result). But consider that this could be a valid way of learning something new about yourself: similarly, you could learn something about your values by conducting a thought experiment (for example, you might learn something about your moral framework by imagining what you would do if you were transported into the trolley problem).
I haven’t read the paper. I think doing a little work on introspection is worthwhile. But I naively expect that it’s quite intractable to do introspection science when we don’t have access to the ground truth, and such cases are the only important ones. Relatedly, these tasks are trivialized by letting the model call itself, while letting the model call itself gives no help on welfare or “true preferences” introspection questions, if I understand correctly. [Edit: like, the inner states here aren’t the important hidden ones.]
I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won’t have a chance to write up thoughts here more cleanly for a few days.
ETA: Briefly, the key points are:
Honesty issues for introspection aren’t obviously much worse than they are for simple probing. (But fair if you’re already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it’s probably quite difficult for a model to tell on which inputs it can get away with lying.
I’m confused/skeptical about this being relevant, I thought honesty is orthogonal to whether the model has access to its mental states.
Probably I misunderstood your concern. I interpreted your concern about settings where we don’t have access to ground truth as relating to cases where the model could lie about its inner states without us being able to tell (because of lack of ground truth). But maybe you’re more worried about being able to develop a (sufficiently diverse) introspection training signal in the first place?
I’ll also note that I’m approaching this from the angle of “does introspection have worse problems with lack-of-ground-truth than traditional interpretability?” where I think the answer isn’t that clear without thinking about it more. Traditional interpretability often hill-climbs on “producing explanations that seem plausible” (instead of hill climbing on ground-truth explanations, which we almost never have access to), and I’m not sure whether this poses more of a problem for traditional interpretability vs. black-box approaches like introspection.
Thanks Sam. That tweet could be a good stand-alone LW post once you have time to clean up.
I think ground-truth is more expensive, noisy, and contentious as you get to questions like “What are your goals?” or “Do you have feelings?”. I still think it’s possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It’s possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.
Where we have groundtruth: We have groundtruth for questions like the ones we study above (about properties of model behavior on a given prompt), and for questions like “Would you answer question [hard math question] correctly?”. This can be extended to other counterfactual questions like “Suppose three words were deleted from this [text]. Which choice of three words have most change your rating of the quality of the text?”
Where groundtruth is more expensive and/or less clearcut. E.g. “Would you answer question [history exam question] correctly?”. Or questions about which concepts the model is using to solve a problem, what the model’s goals or references are. I still think we can gather evidence that makes answers to these questions more or less likely—esp. if we average over a large set of such questions.
Our original thinking was along the lines of: we’re interested in introspection. But introspection about inner states is hard to evaluate, since interpretability is not good enough to determine whether a statement of an LLM about its inner states is true. Additionally, it could be the case that a model can introspect on its inner states, but no language exists by which it can be expressed (possibly since its different from human inner states). So we have to ground it in something measurable. And the measurable thing we ground it in is knowledge of ones own behavior. In order to predict behavior, the model has to have access to some information about itself, even if it can’t necessarily express it. But we can measure whether it can employ ti for some other goal (in this case, self-prediction).
It’s true that the particular questions that we ask it could be answered with a pretty narrow form of self-knowledge (namely, internal self-simulation + reasoning about the result). But consider that this could be a valid way of learning something new about yourself: similarly, you could learn something about your values by conducting a thought experiment (for example, you might learn something about your moral framework by imagining what you would do if you were transported into the trolley problem).