I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes—Link here. The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the “opponent model,” and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment—but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.
I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.
I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, “the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.””
As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim—misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,
I really like this formulation, and it greatly clarifies something I was trying to note in my recent paper on multiparty dynamics and failure modes—Link here. The discussion about the likelihood of mesa-optimization due to human modeling is close to the more general points I tried to make in the discussion section of that paper. As argued here about humans, other systems are optimizers (even if they are themselves only base-optimizers,) and therefore any successful machine-learning system in a multiparty environment is implicitly forced to model the other parties. I called this the “opponent model,” and argued that they are dangerous because they are always approximate, arguing directly from that point to claim there is great potential for misalignment—but the implication from this work is that they are also dangerous because it encourages machine learning in multi-agent systems to be mesa-optimizers, and the mesa-optimization is a critical enabler of misalignment even when the base optimizer is well aligned.
I would add to the discussion here that multiparty systems can display the same dynamics, and therefore have risks similar to that of systems which require human models. I also think, less closely connected to the current discussion, but directly related to my paper, that mesa-optimizers misalignments pose new and harder to understand risks when they interact with one another.
I also strongly agree with the point that current examples are not really representative of the full risk. Unfortunately, peer-reviewers strongly suggested that I have moreconcrete examples of failures. But as I said in the paper, “the failures seen so far are minimally disruptive. At the same time, many of the outlined failures are more problematic for agents with a higher degree of sophistication, so they should be expected not to lead to catastrophic failures given the types of fairly rudimentary agents currently being deployed. For this reason, specification gaming currently appears to be a mitigable problem, or as Stuart Russell claimed, be thought of as “errors in specifying the objective, period.””
As a final aside, I think that the concept of mesa-optimizers is very helpful in laying out the argument against that last claim—misalignment is more than just misspecification. I think that this paper will be very helpful in showing why,
That link is broken, and I’m very interested to read this. Is there a newer link?
ETA: I think I found it: https://arxiv.org/abs/1810.10862
The better link is the final version - https://www.mdpi.com/2504-2289/3/2/21/htm
The link in the original post broke because it included the trailing period by accident.