That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1)
I understand. However, there’s a subtle distinction here which I didn’t explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it’s not a pangolin. However, ‘explaining away’ has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it’s Persona B (even though it can’t definitively rule out Persona A).
On further reflection I don’t know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
I don’t have a strong take haha. I’m just expressing my own uncertainty.
Here’s my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn’t expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven’t thought about it much more than that, so take this w/ a pinch of salt.
That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
I understand. However, there’s a subtle distinction here which I didn’t explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it’s not a pangolin. However, ‘explaining away’ has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it’s Persona B (even though it can’t definitively rule out Persona A).
On further reflection I don’t know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
Edit: reworded last sentence for clarity
Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
I don’t have a strong take haha. I’m just expressing my own uncertainty.
Here’s my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn’t expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven’t thought about it much more than that, so take this w/ a pinch of salt.