it’s much harder to know if you’ve got it pointed in the right direction or not
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.
Perhaps, but the type of thing I’m describing in the post is more preventing worse-than-death outcomes even if the sign is flipped (by designing a reward function/model in such a way that it’s not going to torture everyone if that’s the case.)
This seems easier than recognising whether the sign is flipped or just designing a system that can’t experience these sign-flip type errors; I’m just unsure whether this is something that we have robust solutions for. If it turns out that someone’s figured out a reliable solution to this problem, then the only real concern is whether the AI’s developers would bother to implement it. I’d much rather risk the system going wrong and paperclipping than going wrong and turning “I have no mouth, and I must scream” into a reality.