If I understand your examples correctly, one way a classifier can be manipulative is by learning to control its training protocol/environment? Does this means that a fixed training protocol (without changes in the training sets or programmer interventions) would forbid this kind of manipulation?
This still might be a problem, since some approaches to AI Safety rely on human supervision/intervention.
The “swap out low-error training datasets” and “retrain if error doubles” can both be part of a fixed training environment. Kaj’s example is also one where the programmer had a fixed environment that they thought they understood.
The problem is the divergence between what the programmer thought the goal was, and what it really was.
When taking the digital evolution example, it seems that by “manipulation”, you mean that the tricks used by the programmers at training time did not prevent the behavior they were supposed to prevent. Which fits the intuitive notion of manipulative.
Then, we might see these examples as evidence that even classifiers, arguably the simplest of learning algorithms with the least amount of interaction with the environment, cannot be steered away from maximizing their goal by ad-hoc variations in the training protocol. Is that what your point was?
Somehow, I also see your examples (at least the first one, and the digital evolution one) as indicative that classifiers are robust against manipulations by their programmers. Because these classifiers are trying to maximize their predictive accuracy or their fitness, and the programmers are trying to make them maximize something else. Hence, we can see the behavior of these classifiers as “fault-tolerant”, in a way.
Though this can be a big issue when the target of prediction is not exactly what we wanted, and thus we would like to steer it to something different.
If I understand your examples correctly, one way a classifier can be manipulative is by learning to control its training protocol/environment? Does this means that a fixed training protocol (without changes in the training sets or programmer interventions) would forbid this kind of manipulation?
This still might be a problem, since some approaches to AI Safety rely on human supervision/intervention.
The “swap out low-error training datasets” and “retrain if error doubles” can both be part of a fixed training environment. Kaj’s example is also one where the programmer had a fixed environment that they thought they understood.
The problem is the divergence between what the programmer thought the goal was, and what it really was.
When taking the digital evolution example, it seems that by “manipulation”, you mean that the tricks used by the programmers at training time did not prevent the behavior they were supposed to prevent. Which fits the intuitive notion of manipulative.
Then, we might see these examples as evidence that even classifiers, arguably the simplest of learning algorithms with the least amount of interaction with the environment, cannot be steered away from maximizing their goal by ad-hoc variations in the training protocol. Is that what your point was?
Somehow, I also see your examples (at least the first one, and the digital evolution one) as indicative that classifiers are robust against manipulations by their programmers. Because these classifiers are trying to maximize their predictive accuracy or their fitness, and the programmers are trying to make them maximize something else. Hence, we can see the behavior of these classifiers as “fault-tolerant”, in a way.
Though this can be a big issue when the target of prediction is not exactly what we wanted, and thus we would like to steer it to something different.
That, and the fact these ad-hoc variations can introduce new goals that the programmers are not aware of.