Also, I don’t agree that “see if an AIXI-like agent would be aligned” is the correct “gauntlet” to be thinking about; that kind of alignment seems doomed to me, but in any case the AI systems we actually build are not going to look anything like that.
I’m going to do my best to describe my intuitions around this.
Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn’t have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.
Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent’s world-model will converge as close to the Bayesian world-model as it does to the truth.
Proposition 3: If the version of an “idea” that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is “based on that idea” will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).
Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it’s approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.
Maybe you can point me toward the steps that seem the most opaque/fishy.
Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into “goals”, “world-model”, and “planning” is the wrong way to be decomposing agents. I hope to write a post about this soon.
No, that’s helpful. If it were the right way, do you think this reasoning would apply?
Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?
… Plausibly? Idk, it’s very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don’t share. I don’t see anything obviously wrong with it.
There’s the usual issue that Bayesian reasoning doesn’t properly account for embeddedness, but I don’t think that would make much of a difference here.
I’m going to do my best to describe my intuitions around this.
Proposition 1: an agent will be competent at achieving goals in our environment to the extent that its world-model converges to the truth. It doesn’t have to converge all the way, but the KL-divergence from the true world-model to its world-model should reach the order of magnitude of the KL-divergence from the true world-model to a typical human world-model.
Proposition 2: The world-model resulting from Bayesian reasoning with a sufficiently large model class does converge to the truth, so from Proposition 1, any competent agent’s world-model will converge as close to the Bayesian world-model as it does to the truth.
Proposition 3: If the version of an “idea” that uses Bayesian reasoning (on a model class including the truth) is unsafe, then the kind of agent we actually build that is “based on that idea” will either a) not be competent, or b) roughly approximate the Bayesian version, and by default, be unsafe as well (in the absence of some interesting reason why a small confusion about future events will lead to a large deprioritization of dangerous plans).
Letting F be a failure mode that arises when an idea is implemented in the framework of Bayesian agent with a model class including the truth, I expect in the absence of arguments otherwise, that the same failure mode will appear in any competent agent which also implements the idea in some way. However, it can be much harder to spot it, so I think one of the best ways to look for possible failure modes in the sort of AI we actually build is to analyze the idealized version, i.e. an agent it’s approximating, i.e. a Bayesian agent with a model class including the truth. And then on the flip side, if the idea still seems to have real value when formalized in a Bayesian agent with a large model class, tractable approximations thereof seem (relatively) likely to work similarly well.
Maybe you can point me toward the steps that seem the most opaque/fishy.
Sorry in advance for how unhelpful this is going to be. I think decomposing an agent into “goals”, “world-model”, and “planning” is the wrong way to be decomposing agents. I hope to write a post about this soon.
No, that’s helpful. If it were the right way, do you think this reasoning would apply?
Edit: alternatively, if a proposal does decompose an agent into world-model/goals/planning (as IRL does), does the argument stand that we should try to analyze the behavior of a Bayesian agent with a large model class which implements the idea?
… Plausibly? Idk, it’s very hard for me to talk about the validity of intuitions in an informal, intuitive model that I don’t share. I don’t see anything obviously wrong with it.
There’s the usual issue that Bayesian reasoning doesn’t properly account for embeddedness, but I don’t think that would make much of a difference here.