Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn’t be very exciting.
Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot’s name) for their experiments,
A1→B1A2→B2A1β
where β corresponds much more strongly to B1 than B2. In our experiments the As are the chatbot names, Bs are behaviour descriptions and βs are observed behaviours in line with the descriptions. My setup is:
A1→B1A2→B2βA1
The key difference here is that β is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it’s intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).
Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore their results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.
I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an abductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).
One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. ‘explaining away’).
Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).
That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1)
I understand. However, there’s a subtle distinction here which I didn’t explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it’s not a pangolin. However, ‘explaining away’ has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it’s Persona B (even though it can’t definitively rule out Persona A).
On further reflection I don’t know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
I don’t have a strong take haha. I’m just expressing my own uncertainty.
Here’s my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn’t expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven’t thought about it much more than that, so take this w/ a pinch of salt.
Thanks for having a read!
Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot’s name) for their experiments,
A1→B1A2→B2A1β
where β corresponds much more strongly to B1 than B2. In our experiments the As are the chatbot names, Bs are behaviour descriptions and βs are observed behaviours in line with the descriptions. My setup is:
A1→B1A2→B2βA1
The key difference here is that β is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it’s intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).
Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore their results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.
I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an abductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).
Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).
That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
I understand. However, there’s a subtle distinction here which I didn’t explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it’s not a pangolin. However, ‘explaining away’ has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it’s Persona B (even though it can’t definitively rule out Persona A).
On further reflection I don’t know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
Edit: reworded last sentence for clarity
Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
I don’t have a strong take haha. I’m just expressing my own uncertainty.
Here’s my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn’t expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven’t thought about it much more than that, so take this w/ a pinch of salt.