I am confused. Perhaps the above sentence is true in some tautological sense I’m missing. But in the sections of the report listing training strategies and corresponding counterexamples, I wouldn’t describe most counterexamples as based on ontology mismatch.
In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.
So: do some of your training strategies work perfectly in the nice-ontology case, where the model has a concept of “the diamond is in the room”?
ARC is always considering the case where the model does “know” the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.
The ontology mismatch problem is not referring to the case where the AI “just doesn’t have” some concept—we’re always assuming there’s some “actually correct / true” translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like “whether the diamond is in the room,” and is pretty easy for the AI to find.
For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some “true” translation between that and normal objects like “tables / chairs / apples” because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation—if given the right incentive—just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.
The worry explored in this report is not that the AI won’t know how to do the translation; it’s instead a question of what our loss functions incentivize. Even if it wouldn’t be “that hard” to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.
In the report, the first volley of examples and counterexamples are not focused solely on ontology mismatch, but everything after the relevant section is.
ARC is always considering the case where the model does “know” the right answer to whether the diamond is in the room in the sense that it is discussed in the self-contained problem statement appendix here.
The ontology mismatch problem is not referring to the case where the AI “just doesn’t have” some concept—we’re always assuming there’s some “actually correct / true” translation between the way the AI thinks about the world and the way the human thinks about the world which is sufficient to answer straightforward questions about the physical world like “whether the diamond is in the room,” and is pretty easy for the AI to find.
For example, if the AI discovered some new physics and thinks in terms of hyper-strings in a four-dimensional manifold, there is some “true” translation between that and normal objects like “tables / chairs / apples” because the four-dimensional hyper-strings are describing a universe that contains tables / chairs / apples; furthermore, an AI smart enough to derive that complicated physics could pretty easily do that translation—if given the right incentive—just as human quantum physicists can translate between the quantum view of the world and the Newtonian view of the world or the folk physics view of the world.
The worry explored in this report is not that the AI won’t know how to do the translation; it’s instead a question of what our loss functions incentivize. Even if it wouldn’t be “that hard” to translate in some absolute sense, with the most obvious loss functions we can come up with it might be simpler / more natural / lower-loss to simply do inference in the human Bayes net.