Ask dumb questions! … we encourage people to ask clarifying questions in the comments of this post (no matter how “dumb” they are)
ok… disclaimer: I know little about ML and I didn’t read all of the report.
All of our counterexamples are based on an ontology mismatch between two different Bayes nets, one used by an ML prediction model (“the predictor”) and one used by a human.
I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch. And the above sentence seems in tension with this from the report:
We very tentatively think of ELK as having two key difficulties: ontology identification and learned optimization. … We don’t think these two difficulties can be very precisely distinguished — they are more like genres of counterexamples
So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”? If so, I missed this in the report and this feels like quite a strong result to me; if not, there are counterexamples based on things other than ontology mismatch.
I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch.
In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.
So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”?
ARC is always considering the case where the model does “know” the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.
The ontology mismatch problem is not referring to the case where the AI “just doesn’t have” some concept—we’re always assuming there’s some “actually correct / true” translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like “whether the diamond is in the room,” and is pretty easy for the AI to find.
For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some “true” translation between that and normal objects like “tables / chairs / apples” because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation—if given the right incentive—just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.
The worry explored in this report is not that the AI won’t know how to do the translation; it’s instead a question of what our loss functions incentivize. Even if it wouldn’t be “that hard” to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.
ok… disclaimer: I know little about ML and I didn’t read all of the report.
I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch. And the above sentence seems in tension with this from the report:
So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”? If so, I missed this in the report and this feels like quite a strong result to me; if not, there are counterexamples based on things other than ontology mismatch.
In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.
ARC is always considering the case where the model does “know” the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.
The ontology mismatch problem is not referring to the case where the AI “just doesn’t have” some concept—we’re always assuming there’s some “actually correct / true” translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like “whether the diamond is in the room,” and is pretty easy for the AI to find.
For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some “true” translation between that and normal objects like “tables / chairs / apples” because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation—if given the right incentive—just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.
The worry explored in this report is not that the AI won’t know how to do the translation; it’s instead a question of what our loss functions incentivize. Even if it wouldn’t be “that hard” to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.