My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn’t label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to either incorporate a perfect understanding of human value extrapolation baked into the process somehow, or implies that such an automated ontology identification process is not possible.
Personally, I think that it’s possible that we don’t really need a lot of understanding of human values beyond what we can extract from the easy set to figure out “when to stop.” There seems to be a pretty big set where it’s totally unambiguous what it means for the diamond to be there, and yet humans are unable to label it, and to me this seems like the main set of things we’re worried about with ELK.
My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn’t label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to either incorporate a perfect understanding of human value extrapolation baked into the process somehow, or implies that such an automated ontology identification process is not possible.
That is a good summary of the argument.
Personally, I think that it’s possible that we don’t really need a lot of understanding of human values beyond what we can extract from the easy set to figure out “when to stop.”
Thanks for this question.
Consider a problem involving robotic surgery of somebody’s pet dog, and suppose that there is a plan that would transform all the dog’s entire body as if it were mirror-imaged (left<->right). This dog will have a perfectly healthy body, but it will actually be unable to consume any food currently on Earth because the chirality of its biology will render it incompatible with the biology of any other living thing on the planet, so it will starve. We would not want to execute this plan and there are of course ordinary questions that, if we think to ask them, will reveal that the plan is bad (“will the dog be able to eat food from planet Earth?”). But if we only think to ask “is the dog healthy?” then we would really like our system not to try to extrapolate concepts of “healthy” to cases where a dog’s entire body is mirror-imaged. But how would any system know that this particular simple geometric transformation (mirror-imaging) is dangerous, while other simple geometrical transformations on the dog’s body—such as physically moving or rotating the dog’s body in space—are benign? I think it would have to know what we value with respect to the dog.
To make this example sharper, let the human providing the training examples be someone alive today who has no idea that biological chirality is a thing. How surprised they would be that their seemingly-healthy pet dog died a few days after the seemingly-successful surgery. Even if they did an autopsy on their dog to find out why it died, it would appear that all the organs and such were perfectly healthy and normal. It would be quite strange to them.
Now what if it was the diamond that was mirror-imaged inside the vault? Is the diamond still in the vault? Yeah, of course, we don’t care about a diamond being mirror imaged. It seems to me that the reason we don’t care in the case of the diamond is that the qualities we value in a diamond are not affected by mirror-imaging.
Now you might say that atomically mirror-imaging a thing should always be classified as “going too far” in conceptual extrapolation, and that an oracle should refuse to answer questions like “is the diamond in the vault?” or “is the dog healthy?” when the thing under consideration is a mirror image of its original. I agree this kind of refusal would be good, I just think it’s hard to build this without knowing a lot about our values. Why is it that some basic geometric operations like translation and rotation are totally fine to extrapolate over, but mirror-imaging is not? We could hard-code that mirror-imaging is “going too far” but then we just come to another example of the same phenomenon.
I agree that there will be cases where we have ontological crises where it’s not clear what the answer is, i.e whether the mirrored dog counts as “healthy”. However, I feel like the thing I’m pointing at is that there is some sort of closure of any given set of training examples where, for some fairly weak assumptions, we can know that everything in this expanded set is “definitely not going too far”. As a trivial example, anything that is a direct logical consequence of anything in the training set would be part of the completion. I expect any ELK solutions to look something like that. This corresponds directly to the case where the ontology identification process converges to some set smaller than the entire set of all cases.
My understanding of the argument: if we can always come up with a conservative reporter (one that answers yes only when the true answer is yes), and this reporter can label at least one additional data point that we couldn’t label before, we can use this newly expanded dataset to pick a new reporter, feed this process back into itself ad infinitum to label more and more data, and the fixed point of iterating this process is the perfect oracle. This would imply an ability to solve arbitrary model splintering problems, which seems like it would need to either incorporate a perfect understanding of human value extrapolation baked into the process somehow, or implies that such an automated ontology identification process is not possible.
Personally, I think that it’s possible that we don’t really need a lot of understanding of human values beyond what we can extract from the easy set to figure out “when to stop.” There seems to be a pretty big set where it’s totally unambiguous what it means for the diamond to be there, and yet humans are unable to label it, and to me this seems like the main set of things we’re worried about with ELK.
That is a good summary of the argument.
Thanks for this question.
Consider a problem involving robotic surgery of somebody’s pet dog, and suppose that there is a plan that would transform all the dog’s entire body as if it were mirror-imaged (left<->right). This dog will have a perfectly healthy body, but it will actually be unable to consume any food currently on Earth because the chirality of its biology will render it incompatible with the biology of any other living thing on the planet, so it will starve. We would not want to execute this plan and there are of course ordinary questions that, if we think to ask them, will reveal that the plan is bad (“will the dog be able to eat food from planet Earth?”). But if we only think to ask “is the dog healthy?” then we would really like our system not to try to extrapolate concepts of “healthy” to cases where a dog’s entire body is mirror-imaged. But how would any system know that this particular simple geometric transformation (mirror-imaging) is dangerous, while other simple geometrical transformations on the dog’s body—such as physically moving or rotating the dog’s body in space—are benign? I think it would have to know what we value with respect to the dog.
To make this example sharper, let the human providing the training examples be someone alive today who has no idea that biological chirality is a thing. How surprised they would be that their seemingly-healthy pet dog died a few days after the seemingly-successful surgery. Even if they did an autopsy on their dog to find out why it died, it would appear that all the organs and such were perfectly healthy and normal. It would be quite strange to them.
Now what if it was the diamond that was mirror-imaged inside the vault? Is the diamond still in the vault? Yeah, of course, we don’t care about a diamond being mirror imaged. It seems to me that the reason we don’t care in the case of the diamond is that the qualities we value in a diamond are not affected by mirror-imaging.
Now you might say that atomically mirror-imaging a thing should always be classified as “going too far” in conceptual extrapolation, and that an oracle should refuse to answer questions like “is the diamond in the vault?” or “is the dog healthy?” when the thing under consideration is a mirror image of its original. I agree this kind of refusal would be good, I just think it’s hard to build this without knowing a lot about our values. Why is it that some basic geometric operations like translation and rotation are totally fine to extrapolate over, but mirror-imaging is not? We could hard-code that mirror-imaging is “going too far” but then we just come to another example of the same phenomenon.
I agree that there will be cases where we have ontological crises where it’s not clear what the answer is, i.e whether the mirrored dog counts as “healthy”. However, I feel like the thing I’m pointing at is that there is some sort of closure of any given set of training examples where, for some fairly weak assumptions, we can know that everything in this expanded set is “definitely not going too far”. As a trivial example, anything that is a direct logical consequence of anything in the training set would be part of the completion. I expect any ELK solutions to look something like that. This corresponds directly to the case where the ontology identification process converges to some set smaller than the entire set of all cases.