This isn’t quite embedded agency, but it requires the base optimizer to be “larger” than the mesa-optimizer, only allowing mesa-suboptimizers, which is unlikely to be guaranteed in general.
Size might be easier to handle if some parts of the design are shared. For example, if the mesa-optimizer’s design was the same as the agent, and the agent understood itself, and knew the mesa-optimizer’s design, then it seems like them being the same size wouldn’t be (as much of) an issue.
Principal optimization failures occur either if the mesa-optimiser itself falls prey to a Goodhart failure due to shared failures in the model, or if the mesa-optimizer model or goals are different than the principal’s in ways that allow the metrics not to align with the principals’ goals. (Abrams correctly noted in an earlier comment that this is misalignment. I’m not sure, but it seems this is principally a terminology issue.)
1) It seems like there’s a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn’t a perfect score), because it solved them the way I solve them, that doesn’t sound like misalignment.
2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
Lastly, there are mesa-transoptimizers, where typical human types of principle-agent failures can occur because the mesa-optimizer has different goals. The other way this occurs is if the mesa-optimizer has access to or builds a different model than the base-optimzer.
Some efforts require multiple parties being on the same page. Perhaps a self driving car that drives on the wrong side of the road could be called “unsynchronized” or “out of sync”. (If we really like using the word “alignment” the ideal state could be called “model alignment”.)
2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
It seems like there’s a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn’t a perfect score), because it solved them the way I solve them, that doesn’t sound like misalignment.
The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That’s why I’m calling it a principal alignment failure—even though it’s the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.
Size might be easier to handle if some parts of the design are shared. For example, if the mesa-optimizer’s design was the same as the agent, and the agent understood itself, and knew the mesa-optimizer’s design, then it seems like them being the same size wouldn’t be (as much of) an issue.
1) It seems like there’s a difference between the two cases. If I write a program to take the CRT, and then we both take it, and we both get the same score (and that isn’t a perfect score), because it solved them the way I solve them, that doesn’t sound like misalignment.
2) Calling the issues between the agents because of model differences “terminology issues” could also work well—this may be a little like people talking past each other.
Some efforts require multiple parties being on the same page. Perhaps a self driving car that drives on the wrong side of the road could be called “unsynchronized” or “out of sync”. (If we really like using the word “alignment” the ideal state could be called “model alignment”.)
I really like this point. I think it’s parallel to the human issue where different models of the world can lead to misinterpretation of the “same” goal. So “terminology issues” would include, for example, two different measurements of what we would assume is the same quantity. If the base optimizer is looking to set the temperature and using a wall-thermometer, while the mesa-optimizer is using one located on the floor, the mesa-optimizer might be mis-aligned because it interprets “temperature” as referring to a different fact than the base-optimizer. On the other hand, when the same metric is being used by both parties, the class of possible mistakes does not include what we’re not calling terminology issues.
I think this also points to a fundamental epistemological issue, one even broader than goal-representation. It’s possible that two models disagree on representation, but agree on all object level claims—think of using different coordinate systems. Because terminology issues can cause mistakes, I’d suggest that agents with non-shared world models can only reliably communicate via object-level claims.
The implication for AI alignment might be that we need AI to either fundamentally model the world the same way as humans, or need to communicate only via object-level goals and constraints.
The misalignment here is between you and the CRT, and reflects your model being misaligned with the goal / reality. That’s why I’m calling it a principal alignment failure—even though it’s the program / mesa-optimizer that fails, the alignment failure is located in the principal, you / the base optimizer.