If agent N is choosing what values agent N+1 should maximize, and it picks r, and if it’s clear to humans that maximizing r is at odds with human interests (as compared to e.g. leaving humans in meaningful control of the situation)—then prima facie agent N has failed to live up to its contract of trying to do what we want.
It seems to me that the default outcome for any process like this is always ”r is at odds with human interests but not in a way that humans will notice until downstream effects of decisions are felt”. This framework does not deal with this problem; it is not incorporated into a model of what we want until feedback is received, and the default response to that feedback will be to execute a nearest unblocked strategy like it. (This is especially concerning because a human is not a secure system, and downstream effects that will not be noticed by the human can include accidental or purposeful social/basilisk-like changes to the human’s value system. The human being in the loop is only superficially protective.)
It seems to me that the default outcome for any process like this is always ”r is at odds with human interests but not in a way that humans will notice until downstream effects of decisions are felt”. This framework does not deal with this problem; it is not incorporated into a model of what we want until feedback is received, and the default response to that feedback will be to execute a nearest unblocked strategy like it. (This is especially concerning because a human is not a secure system, and downstream effects that will not be noticed by the human can include accidental or purposeful social/basilisk-like changes to the human’s value system. The human being in the loop is only superficially protective.)