Do current models have better understanding of text authors than the human coworkers of these authors? I expect this isn’t true right now (though it might be true for more powerful models for people who have written a huge amount of stuff online). This level of understanding isn’t sufficient for superhuman persuasion.
Both “better understanding” and in a sense “superhuman persuasion” seem to be too coarse a way to think about this (I realize you’re responding to a claim-at-similar-coarseness).
Models don’t need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I’m with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D. However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
It’s unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don’t understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation. Obvious-to-me adjustments don’t necessarily help. E.g. giving huge amounts of context, since [inferences about author given input (x1)] are not a subset of [inferences about author given input (x1∪x2∪ … ∪x1000)].
However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
Note that I wasn’t making this argument. I was just reponding to one specific story and then noting “I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties”.
This is obviously only one of many possible arguments.
However, I’m still unclear what you meant by “This level of understanding isn’t sufficient for superhuman persuasion.”. If ‘this’ referred to [human coworker level], then you’re correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It’s not clear to me why it’d be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted ‘this’ as referring to the [understanding level of current models]. In that case it’s not clear to me that this isn’t sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Both “better understanding” and in a sense “superhuman persuasion” seem to be too coarse a way to think about this (I realize you’re responding to a claim-at-similar-coarseness).
Models don’t need to capable of a pareto improvement on human persuasion strategies, to have one superhuman strategy in one dangerous context. This seems likely to require understanding something-about-an-author better than humans, not everything-about-an-author better.
Overall, I’m with you in not (yet) seeing compelling reasons to expect a super-human persuasion strategy to emerge from pretraining before human-level R&D.
However, a specific [doesn’t understand an author better than coworkers] → [unlikely there’s a superhuman persuasion strategy] argument seems weak.
It’s unclear to me what kinds of understanding are upstream pre-requisites of at least one [get a human to do what you want] strategy. It seems pretty easy to miss possibilities here.
If we don’t understand what the model would need to infer from context in order to make a given strategy viable, it may be hard to provide the relevant context for an evaluation.
Obvious-to-me adjustments don’t necessarily help. E.g. giving huge amounts of context, since [inferences about author given input (x1)] are not a subset of [inferences about author given input (x1 ∪ x2 ∪ … ∪ x1000)].
Note that I wasn’t making this argument. I was just reponding to one specific story and then noting “I’m pretty skeptical of the specific stories I’ve heard for wildly superhuman persuasion emerging from pretraining prior to human level R&D capabilties”.
This is obviously only one of many possible arguments.
Sure, understood.
However, I’m still unclear what you meant by “This level of understanding isn’t sufficient for superhuman persuasion.”. If ‘this’ referred to [human coworker level], then you’re correct (I now guess you did mean this ??), but it seems a mildly strange point to make. It’s not clear to me why it’d be significant in the context without strong assumptions on correlation of capability in different kinds of understanding/persuasion.
I interpreted ‘this’ as referring to the [understanding level of current models]. In that case it’s not clear to me that this isn’t sufficient for superhuman persuasion capability. (by which I mean having the capability to carry out at least one strategy that fairly robustly results in superhuman persuasiveness in some contexts)
Yep, I just literally meant, “human coworker level doesn’t suffice”. I was just making a relatively narrow argument here, sorry about the confusion.