This essentially reduces to “What is the next country: Laos, Peru, Fiji?” and “What is the third letter of the next country: Laos, Peru, Fiji?” It’s an extra step, but questionable if it requires anything “introspective”.
I’m also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of “o” being the third letter of “Honduras”.
I’ve been brainstorming about what might make a better test and came up with the following:
Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.
Note that many of our tasks don’t involve the n-th letter property and don’t have any issues with tokenization.
This isn’t exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model’s distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.
I think having the model output the top three choices would be cool. It doesn’t seem to me that it’d be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there’s something I’m not getting?
Seeing the distribution calibration you point out does update my opinion a bit.
I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).
As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.
There is relatedwork you may find interesting. We discuss them briefly in section 5.1 on “Know What They Know”. They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper’s case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.
We didn’t find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.
This essentially reduces to “What is the next country: Laos, Peru, Fiji?” and “What is the third letter of the next country: Laos, Peru, Fiji?” It’s an extra step, but questionable if it requires anything “introspective”.
I’m also not sure asking about the nth letter is a great way of computing an additional property. Tokenization makes this sort of thing unnatural for LLMs to reason about, as demonstrated by the famous Strawberry Problem. Humans are a bit unreliable at this too, as demonstrated by your example of “o” being the third letter of “Honduras”.
I’ve been brainstorming about what might make a better test and came up with the following:
Have the LLM predict what its top three most likely choices are for the next country in the sequence and compare that to the objective-level answer of its output distribution when asked for just the next country. You could also ask the probability of each potential choice and see how well-calibrated it is regarding its own logits.
What do you think?
Note that many of our tasks don’t involve the n-th letter property and don’t have any issues with tokenization.
This isn’t exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model’s distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.
I think having the model output the top three choices would be cool. It doesn’t seem to me that it’d be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there’s something I’m not getting?
Seeing the distribution calibration you point out does update my opinion a bit.
I feel like there’s still a significant distinction though between adding one calculation step to the question versus asking it to model multiple responses. It would have to model its own distribution in a single pass rather than having the distributions measured over multiple passes align (which I’d expect to happen if the fine-tuning teaches it the hypothetical is just like adding a calculation to the end).
As an analogy, suppose I have a pseudorandom black box function that returns an integer. In order to approximate the distribution of its outputs mod 10, I don’t have to know anything about the function; I just can just sample the function and apply mod 10 post hoc. If I want to say something about this distribution without multiple samples, then I actually have to know something about the function.
There is related work you may find interesting. We discuss them briefly in section 5.1 on “Know What They Know”. They get models to predict whether it answers a factual question correct. E.g. Confidence : 54%. In this case, the distribution is only binary (it is either correct or wrong), instead of our paper’s case where it is (sometimes) categorical. But I think training models to verbalize a categorical distribution should work, and there is probably some related work out there.
We didn’t find much related work on whether a model M1 has a very clear advantage in predicting its own distribution versus another model M2 predicting M1. This paper has some mixed but encouraging results.
That makes sense. It’s a good suggestion and would be an interesting experiment to run.