Joe Collman comments on Critiques of the AI control agenda

Joe Collman 16 Feb 2024 19:56 UTC
LW: 2 AF: 1
0
AF
Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn’t true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn’t sufficient for superhuman persuasion.
Both “better understanding” and in a sense “superhuman persuasion” seem to be too coarse a way to think about this (I realize you’re responding to a claim-at-similar-coarseness).
Models don’t need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I’m with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
It’s unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don’t understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don’t necessarily help. E.g. giving huge amounts of context, since [inferences about author given input ( $x_{1}$ )] are not a subset of [inferences about author given input ( $x_{1}$ $\cup$ $x_{2}$ $\cup$ … $\cup$ $x_{1000}$ )].
- ryan_greenblatt 16 Feb 2024 22:09 UTC
  LW: 2 AF: 2
  0
  AF Parent
  
  However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
  
  Note that I wasn’t making this argument. I was just reponding to one specific story and then noting “I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties”.
  
  This is obviously only one of many possible arguments.
  - Joe Collman 17 Feb 2024 7:05 UTC
    LW: 2 AF: 1
    0
    AF Parent
    Sure, understood.
    However, I’m still unclear what you meant by “This level of understanding isn’t sufficient for superhuman persuasion.”. If ‘this’ referred to [human coworker level], then you’re correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It’s not clear to me why it’d be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
    I interpreted ‘this’ as referring to the [understanding level of current models]. In that case it’s not clear to me that this isn’t sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
    - ryan_greenblatt 17 Feb 2024 16:41 UTC
      LW: 2 AF: 2
      0
      AF Parent
      Yep, I just literally meant, “human coworker level doesn’t suffice”. I was just making a relatively narrow argument here, sorry about the confusion.