2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.