Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
it’s much harder to know if you’ve got it pointed in the right direction or not
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.
Actually, I’m not sure that sign flips are easier to deal with. A sentiment I’ve heard expressed before is that it’s much easier to trim something to more a little more or less of it, but it’s much harder to know if you’ve got it pointed in the right direction or not. Ultimately, addressing false positives though ends up being about these kind of directional issues.
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.