If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
Is this such a common practice that we can expect “almost every powerful algorithm” to involve it somehow?
No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I’ll admit that “thinking longer term” is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
I’d suspect that’s right, but I don’t think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There’s the use of the technical term “almost every”, but you did not prove the set of “powerful” algorithms which is not “manipulative” has measure zero. There’s also “would be” instead of “seems” (I think if you made this change, the title would be fine). I think it’s vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I’ll admit that “thinking longer term” is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.
I’d suspect that’s right, but I don’t think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There’s the use of the technical term “almost every”, but you did not prove the set of “powerful” algorithms which is not “manipulative” has measure zero. There’s also “would be” instead of “seems” (I think if you made this change, the title would be fine). I think it’s vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.
Not that I disagree with your suspicion here.
Rephrased the title and the intro to make this clearer.