Daniel Tan comments on Cross-context abduction: LLMs make inferences about procedural training data leveraging declarative facts in earlier training data

Daniel Tan 17 Nov 2024 13:19 UTC
1 point
0
Interesting preliminary results!
Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn’t be very exciting.
One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. ‘explaining away’).
- Sohaib Imran 17 Nov 2024 15:05 UTC
  1 point
  0
  Parent
  Thanks for having a read!
  Do you expect abductive reasoning to be significantly different from deductive reasoning? If not, (and I put quite high weight on this,) then it seems like (Berglund, 2023) already tells us a lot about the cross-context abductive reasoning capabilities of LLMs. I.e. replicating their methodology wouldn’t be very exciting.
  Berglund et. al. (2023) utilise a prompt containing a trigger keyword (the chatbot’s name) for their experiments,
  
  $\begin{matrix} A_{1} \to B_{1} A_{2} \to B_{2} A_{1} β \end{matrix}$
  where $β$ corresponds much more strongly to $B_{1}$ than $B_{2}$ . In our experiments the As are the chatbot names, Bs are behaviour descriptions and $β$ s are observed behaviours in line with the descriptions. My setup is:
  
  $\begin{matrix} A_{1} \to B_{1} A_{2} \to B_{2} β A_{1} \end{matrix}$
  The key difference here is that $β$ is much less specific/narrow since it corresponds with many Bs, some more strongly than others. This is therefore a (Bayesian) inference problem. I think it’s intuitively much easier to forward reason given a narrow trigger to your knowledge base (eg. someone telling you to use the compound interest equation) than to figure out which parts of your knowledge base are relevant (you observe numbers 100, 105, 110.25 and asking which equation would allow you to compute the numbers).
  Similarly, for the RL experiments (experiment 3) they use chatbot names as triggers and therefore their results are relevant to narrow backdoor triggers rather than more general reward hacking leveraging declarative facts in training data.
  I am slightly confused whether their reliance on narrow triggers is central to the difference between abductive and deductive reasoning. Part of the confusion is because (Beysian) inference itself seems like an abductive reasoning process. Popper and Stengel (2024) discuss this a little bit under page 3 and footnote 4 (process of realising relevance).
  One difference that I note here is that abductive reasoning is uncertain / ambiguous; maybe you could test whether the model also reduces its belief of competing hypotheses (c.f. ‘explaining away’).
  Figures 1 and 4 attempt to show that. If the tendency to self-identify as the incorrect chatbot falls with increasing k and iteration than the models are doing this inference process right. In Figures 1 and 4, you can see that the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1).
  - Daniel Tan 17 Nov 2024 18:31 UTC
    1 point
    0
    Parent
    That makes sense! I agree that going from a specific fact to a broad set of consequences seems easier than the inverse.
    the tendency to incorrectly identify as a pangolin falls in all non-pangolin tasks (except for tiny spikes sometimes at iteration 1)
    I understand. However, there’s a subtle distinction here which I didn’t explain well. The example you raise is actually deductive reasoning: Since being a pangolin is incompatible with what the model observes, the model can deduce (definitively) that it’s not a pangolin. However, ‘explaining away’ has more to do with competing hypotheses that would generate the same data but that you consider unlikely.
    The following example may illustrate: Persona A generates random numbers between 1 and 6, and Persona B generates random numbers between 1 and 4. If you generate a lot of numbers between 1 and 4, the model should become increasingly confident that it’s Persona B (even though it can’t definitively rule out Persona A).
    On further reflection I don’t know if ruling out improbably scenarios is that different from ruling out impossible scenarios, but I figured it was worth clarifying
    Edit: reworded last sentence for clarity
    - Sohaib Imran 17 Nov 2024 18:52 UTC
      1 point
      0
      Parent
      Ah I see ur point. Yh I think that’s a natural next step. Why do you think it not very interesting to investigate? Being able to make very accurate inferences given the evidence at hand seems important for capabilities, including alignment relevant ones?
      - Daniel Tan 17 Nov 2024 18:57 UTC
        2 points
        0
        Parent
        I don’t have a strong take haha. I’m just expressing my own uncertainty.
        Here’s my best reasoning: Under Bayesian reasoning, a sufficiently small posterior probability would be functionally equivalent to impossibility (for downstream purposes anyway). If models reason in a Bayesian way then we wouldn’t expect the deductive and abductive experiments discussed above to be that different (assuming the abductive setting gave the model sufficient certainty over the posterior).
        But I guess this could still be a good indicator of whether models do reason in a Bayesian way. So maybe still worth doing? Haven’t thought about it much more than that, so take this w/ a pinch of salt.