It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.
That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.
Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.
Separately, I think your examples depend on this a lot:
Many of the hyperparameters are set by a neural net, which itself takes a more “long-term view” of the error rate, trying to improve it from day to day rather than from run to run.
Is this such a common practice that we can expect “almost every powerful algorithm” to involve it somehow?
If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
Is this such a common practice that we can expect “almost every powerful algorithm” to involve it somehow?
No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I’ll admit that “thinking longer term” is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
I’d suspect that’s right, but I don’t think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There’s the use of the technical term “almost every”, but you did not prove the set of “powerful” algorithms which is not “manipulative” has measure zero. There’s also “would be” instead of “seems” (I think if you made this change, the title would be fine). I think it’s vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.
It seems to me that if we had the budget, we could realize the scenarios you describe today. The manipulative behavior you are discussing is not exactly rocket science.
That in turn makes me think that if we polled a bunch of people who build image classifiers for a living, and asked them whether the behavior you describe would indeed happen if the programmers behaved in the ways you describe, they would near-unanimously agree that it would.
Do you agree with both claims above? If so, then it seems your argument should conclude that even non-powerful algorithms are likely to be manipulative.
Separately, I think your examples depend on this a lot:
Is this such a common practice that we can expect “almost every powerful algorithm” to involve it somehow?
I’d conclude that most algorithms used today have the potential to be manipulative; but they may not be able to find the manipulative behaviour, given their limited capabilities.
No. That was just one example I constructed, one of the easiest to see. But I can build examples in many different situations. I’ll admit that “thinking longer term” is something that makes manipulation much more likely; genuinely episodic algorithms seem much harder to make manipulative. But we have to be sure the algorithm is episodic, and that there is no outer-loop optimisation going on.
I’d suspect that’s right, but I don’t think your title has the appropriate epistemic status. I think people in general should be more careful about for-all quantifiers wrt alignment work. There’s the use of the technical term “almost every”, but you did not prove the set of “powerful” algorithms which is not “manipulative” has measure zero. There’s also “would be” instead of “seems” (I think if you made this change, the title would be fine). I think it’s vitally important we use the correct epistemic markers; if not, this can lead to research predicated on obvious-seeming hunches stated as fact.
Not that I disagree with your suspicion here.
Rephrased the title and the intro to make this clearer.