It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can’t see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
I think people should generally be a little more careful about saying “this requires value-laden information”. First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say “don’t do all these bad things we don’t like”!
Second, it’s always good to check “would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?”:
This isn’t an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.
I think people should generally be a little more careful about saying “this requires value-laden information”. First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say “don’t do all these bad things we don’t like”!
Second, it’s always good to check “would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?”:
This isn’t an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.