Alex Turner’s post you referenced first convinces me that his arguments about “orbit-level power-seeking” apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.
He also claims in that post that the “orbit-level power-seeking” issue affects all schemes that are based on expected utility: “There is no clever EU-based scheme which doesn’t have orbit-level power-seeking incentives.” I don’t see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don’t see how it applies to our algorithms.
Repower-seekingin general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including “empowerment”, reachability, and Alex Turner’s AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.
Alex Turner’s post you referenced first convinces me that his arguments about “orbit-level power-seeking” apply to maximizers and quantilizers/satisficers. Let me reiterate that we are not suggesting quantilizers/satisficers are a good idea, but that I firmly believe explicit safety criteria rather than plain randomization should be used to select plans.
He also claims in that post that the “orbit-level power-seeking” issue affects all schemes that are based on expected utility: “There is no clever EU-based scheme which doesn’t have orbit-level power-seeking incentives.” I don’t see a formal proof of that claim though, maybe I missed it. The rationale he gives below that claim seems to boil down to a counting argument again, which suggests to me some tacit assumption that the agent still chooses uniformly at random from some set of policies. As this is not what we suggest, I don’t see how it applies to our algorithms.
Re power-seeking in general: I believe one important class of safety criteria one should use to select from the many possible plans that can fulfill an aspiration-type goal is criteria that aim to quantify the amount of power/resources/capabilities/control potential the agent has at each time step. There are some promising metrics for this already (including “empowerment”, reachability, and Alex Turner’s AUP). We are currently investigating some versions of such measures, including ones we believe might be novel. A key challenge in doing so is again tractability. Counting the reachable states for example might be intractable, but approximating that number by a recursively computable metric based on Wasserstein distance and Gaussian approximations to latent state distributions seems tractable and might turn out to be good enough.