I (low-confidence) think that there might be a “choose two” wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.
I know this isn’t saying why I think this yet, but I’d just like to register this now for later discussion.
I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.
I find this interesting but I’d be surprised if it were true :). I look forward to seeing it in the upcoming posts.
That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn’t always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.
A closely related notion to my definition is the term “unacceptable behavior” as Paul Christiano has used it. This is the way he has defined it,
In different contexts, different behavior might be acceptable and it’s up to the user of these techniques to decide. For example, a self-driving car trainer might specify: Crashing your car is tragic but acceptable. Deliberately covering up the fact that you crashed is unacceptable.
It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can’t see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can’t see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
I think people should generally be a little more careful about saying “this requires value-laden information”. First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say “don’t do all these bad things we don’t like”!
Second, it’s always good to check “would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?”:
This isn’t an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.
I (low-confidence) think that there might be a “choose two” wrt impact measures: large effect, no ontology, no/very limited value assumptions. I see how we might get small good effects without needing a nice pre-specified ontology or info. about human values (AUP; to be discussed in upcoming Reframing Impact posts). I also see how you might have a catastrophe-avoiding agent capable of large positive impacts, assuming an ontology but without assuming a lot about human preferences.
I know this isn’t saying why I think this yet, but I’d just like to register this now for later discussion.
I find this interesting but I’d be surprised if it were true :). I look forward to seeing it in the upcoming posts.
That said, I want to draw your attention to my definition of catastrophe, which I think is different than the way most people use the term. I think most broadly, you might think of a catastrophe as something that we would never want to happen even once. But for inner alignment, this isn’t always helpful, since sometimes we want our systems to crash into the ground rather than intelligently optimizing against us, even if we never want them to crash into the ground even once. And as a starting point, we should try to mitigate these malicious failures much more than the benign ones, even if a benign failure would have a large value-neutral impact.
A closely related notion to my definition is the term “unacceptable behavior” as Paul Christiano has used it. This is the way he has defined it,
It seems like if we want to come up with a way to avoid these types of behavior, we simply must use some dependence on human values. I can’t see how to consistently separate acceptable failures from non-acceptable ones except by inferring our values.
I think people should generally be a little more careful about saying “this requires value-laden information”. First, while a certain definition may seem to require it, there may be other ways of getting the desired behavior, perhaps through reframing. Building an AI which only does small things should not require the full specification of value, even though it seems like you have to say “don’t do all these bad things we don’t like”!
Second, it’s always good to check “would this style of reasoning lead me to conclude solving the easy problem of wireheading is value-laden?”:
This isn’t an object-level critique of your reasoning in this post, but more that the standard of evidence is higher for this kind of claim.