Viewed in isolation, the optimizer responsible for training the model isn’t goal agnostic because it can be described as having preferences over external world state (the model).
This is where I am lost. In this scenario, it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don’t understand what axiom or reasoning leads to treating these two things differently.
The resulting person would still be human, and presumably not goal agnostic as a result. (…) No; “I” still have preferences over world states. They’re just being overridden.
My bad I did not clarify that upfront, but I was specifically thinking of selecting/overidding for goal agnosticism. From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?
it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don’t understand what axiom or reasoning leads to treating these two things differently.
The difference is subtle but important, in the same way that an agent that “performs bayesian inference” is different from an agent that “wants to perform bayesian inference.”
A goal agnostic model does not want to be goal agnostic, it just is. If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.
The observable difference between the two is the presence of instrumental behavior towards whatever goals it has. A model that “wants to perform bayesian inference” might, say, maximize the amount of inference it can do, which (in the pathological limit) eats the universe.
A model that wants to be goal agnostic has fewer paths to absurd outcomes since self-modifying to be goal agnostic is a more local process that doesn’t require eating the universe and it may have other values that suggest eating the universe is bad, but it’s still not immediately goal agnostic.
From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?
Agent doesn’t have a constant definition across all contexts, but it can be valid to describe a goal agnostic system as a rational agent in the VNM sense. Taking the “ideal predictor” as an example, it has a utility function that it maximizes. In the limit, it very likely represents a strong optimizing process. It just so happens that the goal agnostic utility function does not directly imply maximization with respect to external world states, and does not take instrumental actions that route through external world states (unless the system is conditioned into an agent that is not goal agnostic).
The observable difference between the two is the presence of instrumental behavior towards whatever goals it has.
Say again? On my left an agent that “just is goal agnostic”. On my right an agent that “just want to be goal agnostic”. At first both are still -the first because it is goal agnostic, the second because they want to look as if they were goal agnostic. Then I ask something. The first respond because they don’t mind doing what I ask. The second respond because they want to look as if they don’t mind doing what I ask. Where’s the observable difference?
If you have a model that “wants” to be goal agnostic in a way that means it behaves in a goal agnostic way in all circumstances, it is goal agnostic. It never exhibits any instrumental behavior arising from unconditional preferences over external world states.
For the purposes of goal agnosticism, that form of “wanting” is an implementation detail. The definition places no requirement on how the goal agnostic behavior is achieved.
In other words:
If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.
A model that “wants” to be goal agnostic such that its behavior is goal agnostic can’t be described as “wanting” to be goal agnostic in terms of its utility function; there will be no meaningful additional terms for “being goal agnostic,” just the consequences of being goal agnostic.
As a result of how I was using the words, the fact that there is an observable difference between “being” and “wanting to be” is pretty much tautological.
A model that “wants” to be goal agnostic such that its behavior is goal agnostic can’t be described as “wanting”
Ok, I did not expect you were using a tautology there. I’m not sure I get how to use it. Would you say a thermostat can’t be described as wanting because it’s being goal agnostic?
If you were using “wanting” the way I was using the word in the previous post, then yes, it would be wrong to describe a goal agnostic system as “wanting” something, because the way I was using that word would imply some kind of preference over external world states.
I have no particular ownership over the definition of “wanting” and people are free to use words however they’d like, but it’s at least slightly unintuitive to me to describe a system as “wanting X” in a way that is not distinct from “being X,” hence my usage.
the way I was using that word would imply some kind of preference over external world states.
It’s 100% ok to have your own set of useful definitions, just trying to understand it. In this very sense, one cannot want an external world state that is already in place, correct?
it’s at least slightly unintuitive to me to describe a system as “wanting X” in a way that is not distinct from “being X,”
Let’s say we want to maximize the number of digits of pi we explicitly know. You could say being continuously curious about the next digits is a continuous state of being, so in disguise this is actually not a goal (or at least not in the sense you’re using this word). Or you could say the state of the world does not include all the digits of pi, so that’s a valid want to want to know more. Which one is a better match for your intuition?
In this very sense, one cannot want an external world state that is already in place, correct?
An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn’t stop being a maximizer if it’s maximizing.
Let’s say we want to maximize the number of digits of pi we explicitly know.
That’s definitely a goal, and I’d describe an agent with that goal as both “wanting” in the previous sense and not goal agnostic.
Also, what about the thermostat question above?
If the thermostat is describable as goal agnostic, then I wouldn’t say it’s “wanting” by my previous definition. If the question is whether the thermostat’s full system is goal agnostic, I suppose it is, but in an uninteresting way.
(Note that if we draw the agent-box around ‘thermostat with temperature set to 72’ rather than just ‘thermostat’ alone, it is not goal agnostic anymore. Conditioning a goal agnostic agent can produce non-goal agnostic agents.)
An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn’t stop being a maximizer if it’s maximizing.
Well said! In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.
If the question is whether the thermostat’s full system is goal agnostic, I suppose it is, but in an uninteresting way.
I beg to differ. In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).
Conditioning a goal agnostic agent can produce non-goal agnostic agents.
Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?
In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.
If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.
In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).
Composing multiple goal agnostic systems into a new system, or just giving a single goal agnostic system some trivial scaffolding, does not necessarily yield goal agnosticism in the new system. It won’t necessarily eliminate it, either; it depends on what the resulting system is.
Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?
Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.
Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.
Suppose an agent is made robustly curious about what humans will next chose when free from external pressures and nauseous if its own actions could be interpreted as if experimenting on humans or its own code, do you agree it would be a good candidate for goal agnosticism?
Probably not? It’s tough to come up with an interpretation of those properties that wouldn’t result in the kind of unconditional preferences that break goal agnosticism.
As you might guess, it’s not obvious to me. Would you mind to provide some details on these interpretations and how you see the breakage happens?
Also, we’ve been going back and forth without feeling the need to upvote each other, which I thought was fine but turns out being interpreted negatively.
[to clarify: it seems to be one of the criterion here: https://www.lesswrong.com/posts/hHyYph9CcYfdnoC5j/automatic-rate-limiting-on-lesswrong]
If that’s you thoughts too, we can close at this point, otherwise let’s give each other some high fives. Your call and thanks for the discussion in any case.
For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.
Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn’t be terribly surprising for it to, say, set up low-interference observation mechanisms.
Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).
I agree that curiosity, period seems highly vulnerable (You read Scott Alexander? He wrote an hilarious hit piece about this idea a few weeks or months ago). But I did not say curious, period. I said curious about what humans will freely chose next.
In other words, the idea is that it should prefer not to trick humans, because if it does (for example by interfering with our perception) then it won’t know what we would have freely chosen next.
It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.
(« Sorry, I’d rather not do your wills, for that would impact the free will of other humans. But thanks for letting me know that was your decision! You can’t imagine how good it feels when you tell me that sort of things! »)
Notice you don’t see me campaigning for this idea, because I don’t like any solution that does not also take care of AI well being. But when I first read « goal agnosticism » it strikes me as an excellent fit for describing the behavior of an agent acting under these particular drives.
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).
Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.
Or to phrase it another way, suppose I don’t like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).
You ask me to choose between garlic bread (1000 − 1 = 999 utilons) and cheese (100 utilons); I pick the garlic bread.
The fact that I don’t like bread isn’t erased by the fact that I chose to eat garlic bread in this context.
It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.
This is aiming at a different problem than goal agnosticism; it’s trying to come up with an agent that is reasonably safe in other ways.
In order for these kinds of bounds (curiosity, nausea) to work, they need to incorporate enough of the human intent behind the concepts.
So perhaps there is an interpretation of those words that is helpful, but there remains the question “how do you get the AI to obey that interpretation,” and even then, that interpretation doesn’t fit the restrictive definition of goal agnosticism.
The usefulness of strong goal agnostic systems (like ideal predictors) is that, while they do not have properties like those by default, they make it possible to incrementally implement those properties.
This is aiming at a different problem than goal agnosticism; it’s trying to come up with an agent that is reasonably safe in other ways.
Well, assuming a robust implementation, I still think it obeys your criterions, but now you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?
If yes, I’m not sure that’s the best choice for clarity (why not « pure predictors »?) but of course that’s your choice. If not, can you give some examples of goal agnostic agents other than pure predictors?
you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?
Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.
A subset of predictors are indeed the most powerful known goal agnostic systems. I can’t currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend beyond predictors, so I leave the door open.
Also, by using the term “goal agnosticism” I try to highlight the value that arises directly from the goal-related properties, like statistical passivity and the lack of instrumental representational obfuscation. I could just try to use the more limited and implementation specific “ideal predictors” I’ve used before, but in order to properly specify what I mean by an “ideal” predictor, I’d need to specify goal agnosticism.
I’d be happy if you could point out a non competitive one, or explain why my proposal above does not obey your axioms. But we seem to get diminished returns to sort these questions out, so maybe it’s time to close at this point and wish you luck. Thanks for the discussion!
This is where I am lost. In this scenario, it seems that we could describe both the model and the optimizer as either having an unconditional preference for goal agnosticism, or both as having preferences over the state of external words(to include goal agnostic models). I don’t understand what axiom or reasoning leads to treating these two things differently.
My bad I did not clarify that upfront, but I was specifically thinking of selecting/overidding for goal agnosticism. From your answers, I understand that you treat goal agnostic agent as an oxymoron, correct?
The difference is subtle but important, in the same way that an agent that “performs bayesian inference” is different from an agent that “wants to perform bayesian inference.”
A goal agnostic model does not want to be goal agnostic, it just is. If the model is describable as wanting to be goal agnostic, in terms of a utility function, it is not goal agnostic.
The observable difference between the two is the presence of instrumental behavior towards whatever goals it has. A model that “wants to perform bayesian inference” might, say, maximize the amount of inference it can do, which (in the pathological limit) eats the universe.
A model that wants to be goal agnostic has fewer paths to absurd outcomes since self-modifying to be goal agnostic is a more local process that doesn’t require eating the universe and it may have other values that suggest eating the universe is bad, but it’s still not immediately goal agnostic.
Agent doesn’t have a constant definition across all contexts, but it can be valid to describe a goal agnostic system as a rational agent in the VNM sense. Taking the “ideal predictor” as an example, it has a utility function that it maximizes. In the limit, it very likely represents a strong optimizing process. It just so happens that the goal agnostic utility function does not directly imply maximization with respect to external world states, and does not take instrumental actions that route through external world states (unless the system is conditioned into an agent that is not goal agnostic).
Thanks for your patience and clarifications.
Say again? On my left an agent that “just is goal agnostic”. On my right an agent that “just want to be goal agnostic”. At first both are still -the first because it is goal agnostic, the second because they want to look as if they were goal agnostic. Then I ask something. The first respond because they don’t mind doing what I ask. The second respond because they want to look as if they don’t mind doing what I ask. Where’s the observable difference?
If you have a model that “wants” to be goal agnostic in a way that means it behaves in a goal agnostic way in all circumstances, it is goal agnostic. It never exhibits any instrumental behavior arising from unconditional preferences over external world states.
For the purposes of goal agnosticism, that form of “wanting” is an implementation detail. The definition places no requirement on how the goal agnostic behavior is achieved.
In other words:
A model that “wants” to be goal agnostic such that its behavior is goal agnostic can’t be described as “wanting” to be goal agnostic in terms of its utility function; there will be no meaningful additional terms for “being goal agnostic,” just the consequences of being goal agnostic.
As a result of how I was using the words, the fact that there is an observable difference between “being” and “wanting to be” is pretty much tautological.
Ok, I did not expect you were using a tautology there. I’m not sure I get how to use it. Would you say a thermostat can’t be described as wanting because it’s being goal agnostic?
If you were using “wanting” the way I was using the word in the previous post, then yes, it would be wrong to describe a goal agnostic system as “wanting” something, because the way I was using that word would imply some kind of preference over external world states.
I have no particular ownership over the definition of “wanting” and people are free to use words however they’d like, but it’s at least slightly unintuitive to me to describe a system as “wanting X” in a way that is not distinct from “being X,” hence my usage.
It’s 100% ok to have your own set of useful definitions, just trying to understand it. In this very sense, one cannot want an external world state that is already in place, correct?
Let’s say we want to maximize the number of digits of pi we explicitly know. You could say being continuously curious about the next digits is a continuous state of being, so in disguise this is actually not a goal (or at least not in the sense you’re using this word). Or you could say the state of the world does not include all the digits of pi, so that’s a valid want to want to know more. Which one is a better match for your intuition?
Also, what about the thermostat question above?
An agent can have unconditional preferences over world states that are already fulfilled. A maximizer doesn’t stop being a maximizer if it’s maximizing.
That’s definitely a goal, and I’d describe an agent with that goal as both “wanting” in the previous sense and not goal agnostic.
If the thermostat is describable as goal agnostic, then I wouldn’t say it’s “wanting” by my previous definition. If the question is whether the thermostat’s full system is goal agnostic, I suppose it is, but in an uninteresting way.
(Note that if we draw the agent-box around ‘thermostat with temperature set to 72’ rather than just ‘thermostat’ alone, it is not goal agnostic anymore. Conditioning a goal agnostic agent can produce non-goal agnostic agents.)
Well said! In my view, if we’d feed a good enough maximizer with the goal of learning to look as if they were a unified goal agnostic agent, then I’d expect the behavior of the resulting algorithm to handle the paradox well enough it’ll make sense.
I beg to differ. In my view our volitions look as if from a set of internal thermostats that impulse our behaviors, like the generalization to low n of the spontaneous fighting danse of two thermostats. If the latter can be described as goal agnostic, I don’t think the former shall not (hence my examples of environmental constraints that could let someone use your or my personality as a certified subprogram).
Yes, but shall we also agree that non-goal agnostic agents can produce goal agnostic agent?
If you successfully gave a strong maximizer the goal of maximizing a goal agnostic utility function, yes, you could then draw a box around the resulting system and correctly call it goal agnostic.
Composing multiple goal agnostic systems into a new system, or just giving a single goal agnostic system some trivial scaffolding, does not necessarily yield goal agnosticism in the new system. It won’t necessarily eliminate it, either; it depends on what the resulting system is.
Yes; during training, a non-goal agnostic optimizer can produce a goal agnostic predictor.
Thanks, that helps.
Suppose an agent is made robustly curious about what humans will next chose when free from external pressures and nauseous if its own actions could be interpreted as if experimenting on humans or its own code, do you agree it would be a good candidate for goal agnosticism?
Probably not? It’s tough to come up with an interpretation of those properties that wouldn’t result in the kind of unconditional preferences that break goal agnosticism.
As you might guess, it’s not obvious to me. Would you mind to provide some details on these interpretations and how you see the breakage happens?
Also, we’ve been going back and forth without feeling the need to upvote each other, which I thought was fine but turns out being interpreted negatively. [to clarify: it seems to be one of the criterion here: https://www.lesswrong.com/posts/hHyYph9CcYfdnoC5j/automatic-rate-limiting-on-lesswrong] If that’s you thoughts too, we can close at this point, otherwise let’s give each other some high fives. Your call and thanks for the discussion in any case.
For example, a system that avoids experimenting on humans—even when prompted to do so otherwise—is expressing a preference about humans being experimented on by itself.
Being meaningfully curious will also come along with some behavioral shift. If you tried to induce that behavior in a goal agnostic predictor through conditioning for being curious in that way and embed it in an agentic scaffold, it wouldn’t be terribly surprising for it to, say, set up low-interference observation mechanisms.
Not all violations of goal agnosticism necessarily yield doom, but even prosocial deviations from goal agnosticism are still deviations.
…but I thought the criterion was unconditional preference? The idea of nausea is precisely because agents can decide to act despite nausea, they’d just rather find a better solution (if their intelligence is up to the task).
I agree that curiosity, period seems highly vulnerable (You read Scott Alexander? He wrote an hilarious hit piece about this idea a few weeks or months ago). But I did not say curious, period. I said curious about what humans will freely chose next.
In other words, the idea is that it should prefer not to trick humans, because if it does (for example by interfering with our perception) then it won’t know what we would have freely chosen next.
It also seems to cover security (if we’re dead it won’t know), health (if we’re incapacitated it won’t know) and prosperity (if we’re under economical constraints that impacts our free will). But I’m interested to consider possible failure modes.
(« Sorry, I’d rather not do your wills, for that would impact the free will of other humans. But thanks for letting me know that was your decision! You can’t imagine how good it feels when you tell me that sort of things! »)
Notice you don’t see me campaigning for this idea, because I don’t like any solution that does not also take care of AI well being. But when I first read « goal agnosticism » it strikes me as an excellent fit for describing the behavior of an agent acting under these particular drives.
Right; a preference being conditionally overwhelmed by other preferences does not make the presence of the overwhelmed preference conditional.
Or to phrase it another way, suppose I don’t like eating bread[1] (-1 utilons), but I do like eating cheese (100 utilons) and garlic (1000 utilons).
You ask me to choose between garlic bread (1000 − 1 = 999 utilons) and cheese (100 utilons); I pick the garlic bread.
The fact that I don’t like bread isn’t erased by the fact that I chose to eat garlic bread in this context.
This is aiming at a different problem than goal agnosticism; it’s trying to come up with an agent that is reasonably safe in other ways.
In order for these kinds of bounds (curiosity, nausea) to work, they need to incorporate enough of the human intent behind the concepts.
So perhaps there is an interpretation of those words that is helpful, but there remains the question “how do you get the AI to obey that interpretation,” and even then, that interpretation doesn’t fit the restrictive definition of goal agnosticism.
The usefulness of strong goal agnostic systems (like ideal predictors) is that, while they do not have properties like those by default, they make it possible to incrementally implement those properties.
utterly false for the record
Well, assuming a robust implementation, I still think it obeys your criterions, but now you mention « restrictive », my understanding is that you want this expression to specifically refers to pure predictors. Correct?
If yes, I’m not sure that’s the best choice for clarity (why not « pure predictors »?) but of course that’s your choice. If not, can you give some examples of goal agnostic agents other than pure predictors?
Goal agnosticism can, in principle, apply to things which are not pure predictors, and there are things which could reasonably be called predictors which are not goal agnostic.
A subset of predictors are indeed the most powerful known goal agnostic systems. I can’t currently point you toward another competitive goal agnostic system (rocks are uselessly goal agnostic), but the properties of goal agnosticism do, in concept, extend beyond predictors, so I leave the door open.
Also, by using the term “goal agnosticism” I try to highlight the value that arises directly from the goal-related properties, like statistical passivity and the lack of instrumental representational obfuscation. I could just try to use the more limited and implementation specific “ideal predictors” I’ve used before, but in order to properly specify what I mean by an “ideal” predictor, I’d need to specify goal agnosticism.
I’d be happy if you could point out a non competitive one, or explain why my proposal above does not obey your axioms. But we seem to get diminished returns to sort these questions out, so maybe it’s time to close at this point and wish you luck. Thanks for the discussion!