I think the “soup of heuristics” stories (where the AI is optimizing something far causally upstream of reward instead of something that is downstream or close enough to be robustly correlated) don’t lead to takeover in the same way
Why does it not lead to takeover in the same way?
Because it’s easy to detect and correct (except that correcting it might push you into one of the other regimes).
So far causally upstream of the human evaluator’s opinion? Eg an AI counselor optimizing for getting to know you
Why does it not lead to takeover in the same way?
Because it’s easy to detect and correct (except that correcting it might push you into one of the other regimes).
So far causally upstream of the human evaluator’s opinion? Eg an AI counselor optimizing for getting to know you