Joe Collman comments on Debating with More Persuasive LLMs Leads to More Truthful Answers

Joe Collman 7 Feb 2024 22:04 UTC
LW: 9 AF: 4
8
AF
Interesting—I look forward to reading the paper.
However, given that most people won’t read the paper (or even the abstract), could I appeal for paper titles that don’t overstate the generality of the results. I know it’s standard practice in most fields not to bother with caveats in the title, but here it may actually matter if e.g. those working in governance think that you’ve actually shown “Debating with More Persuasive LLMs Leads to More Truthful Answers”, rather than “In our experiments, Debating with More Persuasive LLMs Led to More Truthful Answers”.
The title matters to those who won’t read the paper, and can’t easily guess at the generality of what you’ll have shown (e.g. that your paper doesn’t include theoretical results suggesting that we should expect this pattern to apply robustly or in general). Again, I know this is a general issue—this just happens to be a context where I can point this out with high confidence without having read the paper :).
- the gears to ascension 9 Feb 2024 7:50 UTC
  2 points
  4
  Parent
  Being misleading about this particular thing—whether persuasion is uniformly good—could have significant negative externalities, so I’d propose that it is important in this particular case to have a title that reduces the likelihood of title misuse. I’d hope that the title can be changed in an amended version fairly soon, so that the paper doesn’t have a chance to spread too far in labs before the title clarification. I do expect a significant portion of people to not be vulnerable to this problem, but I’m thinking in terms of edge case risk in the first place here, so that doesn’t change my opinion much.
- Joe Collman 8 Feb 2024 3:14 UTC
  LW: 2 AF: 1
  1
  AF Parent
  I’d be curious what the take is of someone who disagrees with my comment.
  (I’m mildly surprised, since I’d have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)
  I’m not clear whether the idea is that:
  1. The title isn’t an overstatement.
  2. The title is not misleading. (e.g. because “everybody knows” that it’s not making a claim of generality/robustness)
  3. The title will not mislead significant amounts of people in important ways. It’s marginally negative, but not worth time/attention.
  4. There are upsides to the current name, and it seems net positive. (e.g. if it’d get more attention, and [paper gets attention] is considered positive)
  5. This is the usual standard, so [it’s fine] or [it’s silly to complain about] or …?
  6. Something else.
  I’m not claiming that this is unusual, or a huge issue on its own.
  I am claiming that the norms here seem systematically unhelpful.
  I’m more interested in the general practice than this paper specifically (though I think it’s negative here).
  I’d be particularly interested in a claim of (4) - and whether the idea here is something like [everyone is doing this, it’s an unhelpful equilibrium, but if we unilaterally depart from it it’ll hurt what we care about and not fix the problem]. (this seems incorrect to me, but understandable)
  - ryan_greenblatt 8 Feb 2024 5:19 UTC
    LW: 9 AF: 4
    0
    AF Parent
    I disagreed due to a combination of 2, 3, and 4. (Where 5 feeds into 2 and 3). For 4, the upside is just that the title is less long and confusingly caveated.
    
    Norms around titles seem ok to me given issues with space.
    
    Do you have issues with our recent paper title “AI Control: Improving Safety Despite Intentional Subversion”? (Which seems pretty similar IMO.) Would you prefer this paper was “AI Control: Improving Safety Despite Intentional Subversion in a code backdooring setting”? (We considered titles more like this, but they were too long : (.)
    
    Often with this sort of paper, you want to make some sort of conceptual point in your title (e.g. debate seems promising), but where the paper is only weak evidence for the conceptual point and most of the evidence is just that the method seems generally reasonable.
    
    I think some fraction of the general mass of people in the AI safety community (e.g. median person working at some safety org or persistently lurking on LW) reasonably often get misled into thinking results are considerably stronger than they are based on stuff like titles and summaries. However, I don’t think improving titles has very much alpha here. (I’m much more into avoiding overstating claims in other things like abstracts, blog posts, presentations, etc.)
    
    While I like the paper and think the title is basically fine, I think the abstract is misleading and seems to unnecessarily overstate their results IMO; there is enough space to do better. I’ll probably gripe about this in another comment.
    
    (I’m mildly surprised, since I’d have predicted more of a [this is not a useful comment] reaction, than a [this is incorrect] reaction)
    
    My reaction is mostly “this isn’t useful”, but this is implicitly a disagreement with stuff like “but here it may actually matter if e.g. those working in governance think that you’ve actually shown …”.
    - Joe Collman 8 Feb 2024 7:46 UTC
      LW: 4 AF: 3
      0
      AF Parent
      Thanks for the thoughtful response.
      A few thoughts:
      If length is the issue, then replacing “leads” with “led” would reflect the reality.
      I don’t have an issue with titles like ”...Improving safety...” since it has a [this is what this line of research is aiming at] vibe, rather than a [this is what we have shown] vibe. Compare “curing cancer using x” to “x cures cancer”.
      Also in that particular case your title doesn’t suggest [we have achieved AI control]. I don’t think it’s controversial that control would improve safety, if achieved.
      I agree that this isn’t a huge deal in general—however, I do think it’s usually easy to fix: either a [name a process, not a result] or a [say what happened, not what you guess it implies] approach is pretty general.
      Also agreed that improving summaries is more important. Quite hard to achieve given the selection effects: [x writes a summary on y] tends to select for [x is enthusiastic about y] and [x has time to write a summary]. [x is enthusiastic about y] in turn selects for [x misunderstands y to be more significant than it is].
      Improving this situation deserves thought and effort, but seems hard. Great communication from the primary source is clearly a big plus (not without significant time cost, I’m sure). I think your/Buck’s posts on the control stuff are commendably clear and thorough.
      I expect the paper itself is useful (I’ve still not read it). In general I’d like the focus to be on understanding where/how/why debate fails—both in the near-term cases, and the more exotic cases (though I expect the latter not to look like debate-specific research). It’s unsurprising that it’ll work most of the time in some contexts. Completely fine for [show a setup that works] to be the first step, of course—it’s just not the interesting bit.