You seem to have a somewhat general argument against any solution that involves adding onto the utility function in “What if that added solution was bugged instead?”.
I might’ve failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that’d be problematic.
I think it’s better to move on from trying to directly target the sign-flip problem and instead deal with bugs/accidents in general.
I disagree here. Obviously we’d want to mitigate both, but a robust way of preventing sign-flipping type errors specifically is absolutely crucial (if anything, so people stop worrying about it.) It’s much easier to prevent one specific bug from having an effect than trying to deal with all bugs in general.
Would you not agree that (assuming there’s an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it’s extremely unlikely, I’m not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.
For the record, I think that this is also a risk worth worrying about for non-psychological reasons.
I might’ve failed to make my argument clear: if we designed the utility function as U = V + W (where W is the thing being added on and V refers to human values), this would only stop the sign flipping error if it was U that got flipped. If it were instead V that got flipped (so the AI optimises for U = -V + W), that’d be problematic.
I disagree here. Obviously we’d want to mitigate both, but a robust way of preventing sign-flipping type errors specifically is absolutely crucial (if anything, so people stop worrying about it.) It’s much easier to prevent one specific bug from having an effect than trying to deal with all bugs in general.
Would you not agree that (assuming there’s an easy way of doing it), separating the system from hyperexistential risk is a good thing for psychological reasons? Even if you think it’s extremely unlikely, I’m not at all comfortable with the thought that our seed AI could screw up & design a successor that implements the opposite of our values; and I suspect there are at least some others who share that anxiety.
For the record, I think that this is also a risk worth worrying about for non-psychological reasons.