I thought CDT was considered not reflectively-consistent because it fails Newcomb’s problem? (Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn’t have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?) Like, isn’t it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form “at some point in the future, you will be measured, and based on this measurement, your future response will be predicted, and based on this the boxes will be filled”)
My understanding of reflective stability was “the agent would not want to modify its method of reasoning”. (E.g., a person with an addiction is not reflectively stable, because they want the thing (and pursue the thing), but would rather not want (or pursue) the thing. The idea being that, any ideal way of reasoning, should be reflectively stable.
And, I thought that what was being described in the part of this article about recovering quantilizers, was not saying “here’s how you can use this framework to make quantalizers better”, so much as “quantilizers fit within this framework, and can be described within it, where the infrafunction that produces quantilizer-behavior is this one: [the (convex) set of utility functions which differ (in absolute value) from the given one, by, in expectation under the reference policy, at most epsilon]”
So, I think the idea is that, a quantilizer for a given utility function U and reference distribution ν is, in effect, optimizing for an infrafunction that is/corresponds-to the set of utility functions V satisfying the bound in question, and, therefore, any quantilizer, in a sense, is as if it “has this bound” (or, “believes this bound”)
And that therefore, any quantilizer should -
- wait.. that doesn’t seem right..? I was going to say that any quantilizer should therefore be reflectively stable, but that seems like it must be wrong? What if the reference distribution includes always taking actions to modify oneself in a way that would result in not being a quantilizer? uhhhhhh
Ah, hm, it seems to me like the way I was imagining the distribution ν and the context in which you were considering it, are rather different. I was thinking of ν as being an accurate distribution of behaviors of some known-to-be-acceptably-safe agent, whereas it seems like you were considering it as having a much larger support, being much more spread out in what behaviors it has as comparably likely to other behaviors, with things being more ruled-out rather than ruled-in ?
Good point on CDT, I forgot about this. I was using a more specific version of reflective stability.
> - wait.. that doesn’t seem right..?
Yeah this is also my reaction. Assuming that bound seems wrong.
I think there is a problem with thinking of ν as a known-to-be-acceptably-safe agent, because how can you get this information in the first place? Without running that agent in the world? To construct a useful estimate of the expected value of the “safe”-agent, you’d have to run it lots of times, necessarily sampling from it’s most dangerous behaviours.
Unless there is some other non-empirical way of knowing an agent is safe?
Yeah I was thinking of having large support of the base distribution. If you just rule-in behaviours, this seems like it’d restrict capabilities too much.
Well, I was kinda thinking of ν as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of “modify yourself to no-longer be a quantilizer” would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, presumably, not quantilizers, and the types of self-modification actions that would be available are not the same). Though, “create a successor agent” could still be in the human distribution.
Of course, one doesn’t have practical access to “the true probability distribution of human behaviors in context M”, so I guess I was imagining a trained approximation to this distribution.
Hm, well, suppose that the distribution over human-like behaviors includes both making an agent which is a quantilizer and making one which isn’t, both of equal probability. Hm. I don’t see why a general quantilizer in this case would pick the quantilizer over the plain optimizer, as the utility...
Hm... I get the idea that the “quantilizers correspond to optimizing an infra-function of form [...]” thing is maybe dealing with a distribution over a single act?
Or.. if we have a utility function over histories until the end of the episode, then, if one has a model of how the environment will be and how one is likely to act in all future steps, given each of one’s potential actions in the current step, one gets an expected utility conditioned on each of the potential actions in the current step, and this works as a utility function over actions for the current step, and if one acts as a quantilizer over that, each step.. does that give the same behavior as an agent optimizing an infra-function defined using the condition with the L1 norm described in the post, in terms of the utility function over histories for an entire episode, and reference distributions for the whole episode?
I thought CDT was considered not reflectively-consistent because it fails Newcomb’s problem?
(Well, not if you define reflective stability as meaning preservation of anti-Goodhart features, but, CDT doesn’t have an anti-Goodhart feature (compared to some base thing) to preserve, so I assume you meant something a little broader?)
Like, isn’t it true that a CDT agent who anticipates being in Newcomb-like scenarios would, given the opportunity to do so, modify itself to be not a CDT agent? (Well, assuming that the Newcomb-like scenarios are of the form “at some point in the future, you will be measured, and based on this measurement, your future response will be predicted, and based on this the boxes will be filled”)
My understanding of reflective stability was “the agent would not want to modify its method of reasoning”. (E.g., a person with an addiction is not reflectively stable, because they want the thing (and pursue the thing), but would rather not want (or pursue) the thing.
The idea being that, any ideal way of reasoning, should be reflectively stable.
And, I thought that what was being described in the part of this article about recovering quantilizers, was not saying “here’s how you can use this framework to make quantalizers better”, so much as “quantilizers fit within this framework, and can be described within it, where the infrafunction that produces quantilizer-behavior is this one: [the (convex) set of utility functions which differ (in absolute value) from the given one, by, in expectation under the reference policy, at most epsilon]”
So, I think the idea is that, a quantilizer for a given utility function U and reference distribution ν is, in effect, optimizing for an infrafunction that is/corresponds-to the set of utility functions V satisfying the bound in question,
and, therefore, any quantilizer, in a sense, is as if it “has this bound” (or, “believes this bound”)
And that therefore, any quantilizer should -
- wait.. that doesn’t seem right..? I was going to say that any quantilizer should therefore be reflectively stable, but that seems like it must be wrong? What if the reference distribution includes always taking actions to modify oneself in a way that would result in not being a quantilizer? uhhhhhh
Ah, hm, it seems to me like the way I was imagining the distribution ν and the context in which you were considering it, are rather different. I was thinking of ν as being an accurate distribution of behaviors of some known-to-be-acceptably-safe agent, whereas it seems like you were considering it as having a much larger support, being much more spread out in what behaviors it has as comparably likely to other behaviors, with things being more ruled-out rather than ruled-in ?
Good point on CDT, I forgot about this. I was using a more specific version of reflective stability.
> - wait.. that doesn’t seem right..?
Yeah this is also my reaction. Assuming that bound seems wrong.
I think there is a problem with thinking of ν as a known-to-be-acceptably-safe agent, because how can you get this information in the first place? Without running that agent in the world? To construct a useful estimate of the expected value of the “safe”-agent, you’d have to run it lots of times, necessarily sampling from it’s most dangerous behaviours.
Unless there is some other non-empirical way of knowing an agent is safe?
Yeah I was thinking of having large support of the base distribution. If you just rule-in behaviours, this seems like it’d restrict capabilities too much.
Well, I was kinda thinking of ν as being, say, a distribution of human behaviors in a certain context (as filtered through a particular user interface), though, I guess that way of doing it would only make sense within limited contexts, not general contexts where whether the agent is physically a human or something else, would matter. And in this sort of situation, well, the action of “modify yourself to no-longer be a quantilizer” would not be in the human distribution, because the actions to do that are not applicable to humans (as humans are, presumably, not quantilizers, and the types of self-modification actions that would be available are not the same). Though, “create a successor agent” could still be in the human distribution.
Of course, one doesn’t have practical access to “the true probability distribution of human behaviors in context M”, so I guess I was imagining a trained approximation to this distribution.
Hm, well, suppose that the distribution over human-like behaviors includes both making an agent which is a quantilizer and making one which isn’t, both of equal probability. Hm. I don’t see why a general quantilizer in this case would pick the quantilizer over the plain optimizer, as the utility...
Hm...
I get the idea that the “quantilizers correspond to optimizing an infra-function of form [...]” thing is maybe dealing with a distribution over a single act?
Or.. if we have a utility function over histories until the end of the episode, then, if one has a model of how the environment will be and how one is likely to act in all future steps, given each of one’s potential actions in the current step, one gets an expected utility conditioned on each of the potential actions in the current step, and this works as a utility function over actions for the current step,
and if one acts as a quantilizer over that, each step.. does that give the same behavior as an agent optimizing an infra-function defined using the condition with the L1 norm described in the post, in terms of the utility function over histories for an entire episode, and reference distributions for the whole episode?
argh, seems difficult...