IAFF-User-214 comments on ALBA: can you be “aligned” at increased “capacity”?

IAFF-User-214 15 Apr 2017 10:50 UTC
0 points

If agent $N$ is choosing what values agent $N + 1$ should maximize, and it picks $r$ , and if it’s clear to humans that maximizing $r$ is at odds with human interests (as compared to e.g. leaving humans in meaningful control of the situation)—then prima facie agent $N$ has failed to live up to its contract of trying to do what we want.

It seems to me that the default outcome for any process like this is always ” $r$ is at odds with human interests but not in a way that humans will notice until downstream effects of decisions are felt”. This framework does not deal with this problem; it is not incorporated into a model of what we want until feedback is received, and the default response to that feedback will be to execute a nearest unblocked strategy like it. (This is especially concerning because a human is not a secure system, and downstream effects that will not be noticed by the human can include accidental or purposeful social/basilisk-like changes to the human’s value system. The human being in the loop is only superficially protective.)