One thing I’d note is that AIs can learn from variables that humans can’t learn much from, so I think part of what will make this useful for alignment per se is a model of what happens if one mind has learned from a superset of the variables that another mind has learned from.
This model does allow for that. :) We can use this model whenever our two agents agree predictively about some parts of the world X; it’s totally fine if our two agents learned their models from different sources and/or make different predictions about other parts of the world.
As long as you only care about the latent variables that make X1 and X2 independent of each other, right? Asking because this feels isomorphic to classic issues relating to deception and wireheading unless one treads carefully. Though I’m not quite sure whether you intend for it to be applied in this way,
One thing I’d note is that AIs can learn from variables that humans can’t learn much from, so I think part of what will make this useful for alignment per se is a model of what happens if one mind has learned from a superset of the variables that another mind has learned from.
This model does allow for that. :) We can use this model whenever our two agents agree predictively about some parts of the world X; it’s totally fine if our two agents learned their models from different sources and/or make different predictions about other parts of the world.
As long as you only care about the latent variables that make X1 and X2 independent of each other, right? Asking because this feels isomorphic to classic issues relating to deception and wireheading unless one treads carefully. Though I’m not quite sure whether you intend for it to be applied in this way,