TurnTrout comments on Inner alignment requires making assumptions about human values

TurnTrout 20 Jan 2020 22:32 UTC
LW: 9 AF: 5
AF

It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can’t see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.

I think people should generally be a little more careful about saying “this requires value-laden information”. First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say “don’t do all these bad things we don’t like”!

Second, it’s always good to check “would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?”:

This isn’t an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.
What links here?
- TurnTrout's comment on Plausibly, almost every powerful algorithm would be manipulative by Stuart_Armstrong (7 Feb 2020 13:17 UTC; 7 points)