To extended Evan’s comment about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].
Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an issue? Do you have intuitions about the degree of misalignment between huge neural networks that were trained to imitate demonstrators but ended up being consequentialists that care about the state of our universe?
That’s possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options—our world optimized by us and our world optimized by a random deceptive model—probably contain very little of value as judged by agents in another random deceptive model.
So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.
To extended Evan’s comment about coordination between deceptive models: Even if the deceptive models lack relevant game theoretical mechanisms, they may still coordinate due to being (partially) aligned with each other. For example, a deceptive model X may prefer [some other random deceptive model seizing control] over [model X seizing control with probability 0.1% and the entire experiment being terminated with probability 99.9%].
Why should we assume that the deceptive models will be sufficiently misaligned with each other such that this will not be an issue? Do you have intuitions about the degree of misalignment between huge neural networks that were trained to imitate demonstrators but ended up being consequentialists that care about the state of our universe?
That’s possible. But it seems like way less of a convergent instrumental goal for agents living in a simulated world-models. Both options—our world optimized by us and our world optimized by a random deceptive model—probably contain very little of value as judged by agents in another random deceptive model.
So yeah, I would say some models would think like this, but I would expect the total weight on models that do to be much lower.