It seems to me there are certain similarities and correlations between thinking about decision theory (which potentially makes one or an AI one builds more powerful) and thinking about axiology (what terminal goals one should have). They’re both “ought” questions, and If you consider the intelligences that we can see or clearly reason about (individual humans, animals, Bayesian EU maximizer, narrow AIs that exist today), there seems a clear correlation between “ability to improve decision theory via philosophical reasoning” (as opposed to CDT-AI changing into XDT and then being stuck with that) and “tendency to choose one’s goals via philosophical reasoning”.
One explanation for this correlation (and also the only explanation I can see at the moment, besides it being accidental) is that something we call “philosophical ability” is responsible for both. Assuming that’s the case, that still leaves the question of whether philosophical ability backed up with enough computing power eventually leads to goal convergence.
One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness. It doesn’t seem implausible that for example “the ultimate philosopher” would decide that every goal except pursuit of pleasure / avoidance of pain is arbitrary (and think that pleasure/pain is not arbitrary due to philosophy-of-mind considerations).
One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness.
If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?
I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.
I think we don’t just lack introspective access to our goals, but can’t be said to have goals at all (in the sense of preference ordering over some well defined ontology, attached to some decision theory that we’re actually running). For the kind of pseudo-goals we have (behavior tendencies and semantically unclear values expressed in natural language), they don’t seem to have the motivational strength to make us think “I should keep my goal G1 instead of avoiding arbitrariness”, nor is it clear what it would mean to “keep” such pseudo-goals as one self-improves.
What if it’s the case that evolution always or almost always produces agents like us, so the only way they can get real goals in the first place is via philosophy?
The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration. Your response seems more directed at my afterthought about how our intuitions based on human experience would cause us to miss the primary point.
I think that we humans do have goals, despite not being able to consistantly pursue them. I want myself and my fellow humans to continue our subjective experiences of life in enjoyable ways, without modifying what we enjoy. This includes connections to other people, novel experiences, high challenge, etc. There is, of course, much work to be done to complete this list and fully define all the high level concepts, but in the end I think there are real goals there, which I would like to be embodied in a powerful agent that actually runs a coherent decision theory. Philosophy probably has to play some role in clarifying our “pseudo-goals” as actual goals, but so does looking at our “pseudo-goals”, however arbitrary they may be.
The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration.
Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power.
Philosophy probably has to play some role in clarifying our “pseudo-goals” as actual goals, but so does looking at our “pseudo-goals”, however arbitrary they may be.
I wouldn’t argue against this as written, but Stuart was claiming that convergence is “very unlikely” which I think is too strong.
Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power.
I don’t think that follows, or at least the agent could change its decision theory as a result of some consideration, which may or may not be “philosophical”. We already have the example that a CDT agent that learns in advance it will face Newcomb’s problem could predict it would do better if it switched to TDT.
“ability to improve decision theory via philosophical reasoning” (as opposed to CDT-AI changing into XDT and then being stuck with that)
XDT (or in Eliezer’s words, “crippled and inelegant form of TDT”) is closer to TDT but still worse. For example, XDT would fail to acausally control/trade with other agents living before the time of its self-modification, or in other possible worlds.
Ah, yes, I agree that CDT would modify to XDT rather than TDT, though the fact that it self modifies at all shows that goal driven agents can change decision theories because the new decision theory helps it achieve its goal. I do think that it’s important to consider how a particular decision theory can decide to self modify, and to design an agent with a decision theory that can self modify in good ways.
Not strictly. If strongly goal’d agent determines that a different decision theory (or any change to itself) better maximizes its goal, it would adopt that new decision theory or change.
I agree that humans are not utility-maximizers or similar goal-oriented agents—not in the sense we can’t be modeled as such things, but in the sense that these models do not compress our preferences to any great degree, which happens to be because they are greatly at odds with our underlying mechanisms for determining preference and behavior.
Also, can we even get ‘real goals’ like this? We’re threading onto land of potentially proposing something as silly as blue unicorns on back side of the moon. We use goals to model other human intelligences, that is built into our language, that’s how we imagine other agents, that’s how you predict a wolf, a cat, another ape, etc. The goals are really easy within imagination (which is not reductionist and where the true paperclip count exists as a property of the ‘world’). Outside imagination, though...
If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?
Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality. If an AI has rationality as a goal it will avoid arbitrariness, whether or not that assists with G1.
Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality.
Avoiding giving credence to arbitrary beliefs is useful to epistemic rationality and therefor to instrumental rationality, and therefor to goal G1. Avoiding arbitrariness in goals still does not help with achieving G1 if G1 is considered arbitrary. Be careful not to conflate different types of arbitrariness.
If an AI has rationality as a goal
Rationality is not an end goal, it is that which you do in pursuit of a goal that is more important to you than being rational.
If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?
You are making the standard MIRI assumptions that goals are unupdatable, and not including
rationality (non arbitrariness, etc) as a terminal value. (The latter is particularly odd, as Orthogonality implies it).
I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.
I suspect we want universal goals for the same reason we want universal laws.
You are making the standard MIRI assumptions that goals are unupdatable
No, I am arguing that agents with goals generally don’t want to update their goals. Neither I nor MIRI assume goals are unupdatable, actually a major component of MIRI’s research is on how to make sure a self improving AI has stable goals.
and don’t include rationality (non arbitrariness, etc) as a terminal value. (The latter is particularly odd, as Orthogonality implies it).
It is possible to have an agent that terminally values meta properties of its own goal system. Such agents, if they are capable of modifying their goal system, will likely self modify to some self-consistent “attractor” system. This does not mean that all agents will converge on a universal goal system. There are different ways that agents can value meta properties of their own goal system, so there are likely many attractors, and many possible agents don’t have such meta values and will not want to modify their goal systems.
It is possible to have an agent that terminally values meta properties of its own goal system. Such agents, if they are capable of modifying their goal system, will likely self modify to some self-consistent “attractor” system. This does not mean that all agents will converge on a universal goal system.
Who asserted they would? Moral agents can have all sorts of goals, They just have to respect each others values. If Smith wants to be an athlete, and Robinson is a budding writer, that doesn’t mean one of them is immoral.
There are different ways that agents can value meta properties of their own goal system,
Ok. That would be a problem with your suggestion of valuing arbitrary meta properties of their goal system. Then lets go back to my suggestion of valuing rationality.
so there are likely many attractors, and many possible agents don’t have such meta values and will not want to modify their goal systems.
Agents will do what they are built to do. If agents that don’t value rationality are dangerous, build ones that do.
MIRI: “We have detemined that cars without bbrakes are dangerous. We have also determined that the best solution is to reduce the speed limit to 10mph”
Everyone else: “We know cars without brakes are dangerous. That’s why we build them with brakes”.
Who asserted they would? Moral agents can have all sorts of goals, They just have to respect each others values. If Smith wants to be an athlete, and Robinson is a budding writer, that doesn’t mean one of them is immoral.
Have to, or else what? And how do we separate moral agents from agents that are not moral?
Ok. That would be a problem with your suggestion of valuing arbitrary meta properties of their goal system. Then lets go back to my suggestion of valuing rationality.
Agents will do what they are built to do. If agents that don’t value rationality are dangerous, build ones that do.
MIRI: “We have detemined that cars without bbrakes are dangerous. We have also determined that the best solution is to reduce the speed limit to 10mph”
Everyone else: “We know cars without brakes are dangerous. That’s why we build them with brakes”.
If the solution is to build agents that “value rationality,” can you explain how to do that? If it’s something so simple as to be analogous to adding brakes to a car, as opposed to, say, programming the car to be able to drive itself (let alone something much more complicated,) then it shouldn’t be so difficult to describe how to do it.
It seems to me there are certain similarities and correlations between thinking about decision theory (which potentially makes one or an AI one builds more powerful) and thinking about axiology (what terminal goals one should have). They’re both “ought” questions, and If you consider the intelligences that we can see or clearly reason about (individual humans, animals, Bayesian EU maximizer, narrow AIs that exist today), there seems a clear correlation between “ability to improve decision theory via philosophical reasoning” (as opposed to CDT-AI changing into XDT and then being stuck with that) and “tendency to choose one’s goals via philosophical reasoning”.
One explanation for this correlation (and also the only explanation I can see at the moment, besides it being accidental) is that something we call “philosophical ability” is responsible for both. Assuming that’s the case, that still leaves the question of whether philosophical ability backed up with enough computing power eventually leads to goal convergence.
One major element of philosophical reasoning seems to be a distaste for and tendency to avoid arbitrariness. It doesn’t seem implausible that for example “the ultimate philosopher” would decide that every goal except pursuit of pleasure / avoidance of pain is arbitrary (and think that pleasure/pain is not arbitrary due to philosophy-of-mind considerations).
If an agent has goal G1 and sufficient introspective access to know its own goal, how would avoiding arbirtrariness in its goals help it achieve goal G1 better than keeping goal G1 as its goal?
I suspect we humans are driven to philosophize about what our goals ought to be by our lack of introspective access, and that searching for some universal goal, rather than what we ourselves want, is a failure mode of this philosophical inquiry.
I think we don’t just lack introspective access to our goals, but can’t be said to have goals at all (in the sense of preference ordering over some well defined ontology, attached to some decision theory that we’re actually running). For the kind of pseudo-goals we have (behavior tendencies and semantically unclear values expressed in natural language), they don’t seem to have the motivational strength to make us think “I should keep my goal G1 instead of avoiding arbitrariness”, nor is it clear what it would mean to “keep” such pseudo-goals as one self-improves.
What if it’s the case that evolution always or almost always produces agents like us, so the only way they can get real goals in the first place is via philosophy?
The primary point of my comment was to argue that an agent that has a goal in the strong sense would not abandon its goal as a result of philosophical consideration. Your response seems more directed at my afterthought about how our intuitions based on human experience would cause us to miss the primary point.
I think that we humans do have goals, despite not being able to consistantly pursue them. I want myself and my fellow humans to continue our subjective experiences of life in enjoyable ways, without modifying what we enjoy. This includes connections to other people, novel experiences, high challenge, etc. There is, of course, much work to be done to complete this list and fully define all the high level concepts, but in the end I think there are real goals there, which I would like to be embodied in a powerful agent that actually runs a coherent decision theory. Philosophy probably has to play some role in clarifying our “pseudo-goals” as actual goals, but so does looking at our “pseudo-goals”, however arbitrary they may be.
Such an agent would also not change its decision theory as a result of philosophical consideration, which potentially limits its power.
I wouldn’t argue against this as written, but Stuart was claiming that convergence is “very unlikely” which I think is too strong.
I don’t think that follows, or at least the agent could change its decision theory as a result of some consideration, which may or may not be “philosophical”. We already have the example that a CDT agent that learns in advance it will face Newcomb’s problem could predict it would do better if it switched to TDT.
I wrote earlier
XDT (or in Eliezer’s words, “crippled and inelegant form of TDT”) is closer to TDT but still worse. For example, XDT would fail to acausally control/trade with other agents living before the time of its self-modification, or in other possible worlds.
Ah, yes, I agree that CDT would modify to XDT rather than TDT, though the fact that it self modifies at all shows that goal driven agents can change decision theories because the new decision theory helps it achieve its goal. I do think that it’s important to consider how a particular decision theory can decide to self modify, and to design an agent with a decision theory that can self modify in good ways.
Not strictly. If strongly goal’d agent determines that a different decision theory (or any change to itself) better maximizes its goal, it would adopt that new decision theory or change.
I agree that humans are not utility-maximizers or similar goal-oriented agents—not in the sense we can’t be modeled as such things, but in the sense that these models do not compress our preferences to any great degree, which happens to be because they are greatly at odds with our underlying mechanisms for determining preference and behavior.
Also, can we even get ‘real goals’ like this? We’re threading onto land of potentially proposing something as silly as blue unicorns on back side of the moon. We use goals to model other human intelligences, that is built into our language, that’s how we imagine other agents, that’s how you predict a wolf, a cat, another ape, etc. The goals are really easy within imagination (which is not reductionist and where the true paperclip count exists as a property of the ‘world’). Outside imagination, though...
Avoiding arbitrariness is useful to epistemic rationality and therefore to instrumental rationality. If an AI has rationality as a goal it will avoid arbitrariness, whether or not that assists with G1.
Avoiding giving credence to arbitrary beliefs is useful to epistemic rationality and therefor to instrumental rationality, and therefor to goal G1. Avoiding arbitrariness in goals still does not help with achieving G1 if G1 is considered arbitrary. Be careful not to conflate different types of arbitrariness.
Rationality is not an end goal, it is that which you do in pursuit of a goal that is more important to you than being rational.
You are making the standard MIRI assumptions that goals are unupdatable, and not including rationality (non arbitrariness, etc) as a terminal value. (The latter is particularly odd, as Orthogonality implies it).
I suspect we want universal goals for the same reason we want universal laws.
No, I am arguing that agents with goals generally don’t want to update their goals. Neither I nor MIRI assume goals are unupdatable, actually a major component of MIRI’s research is on how to make sure a self improving AI has stable goals.
It is possible to have an agent that terminally values meta properties of its own goal system. Such agents, if they are capable of modifying their goal system, will likely self modify to some self-consistent “attractor” system. This does not mean that all agents will converge on a universal goal system. There are different ways that agents can value meta properties of their own goal system, so there are likely many attractors, and many possible agents don’t have such meta values and will not want to modify their goal systems.
Who asserted they would? Moral agents can have all sorts of goals, They just have to respect each others values. If Smith wants to be an athlete, and Robinson is a budding writer, that doesn’t mean one of them is immoral.
Ok. That would be a problem with your suggestion of valuing arbitrary meta properties of their goal system. Then lets go back to my suggestion of valuing rationality.
Agents will do what they are built to do. If agents that don’t value rationality are dangerous, build ones that do.
MIRI: “We have detemined that cars without bbrakes are dangerous. We have also determined that the best solution is to reduce the speed limit to 10mph”
Everyone else: “We know cars without brakes are dangerous. That’s why we build them with brakes”.
Have to, or else what? And how do we separate moral agents from agents that are not moral?
Valuing rationality for what? What would an agent which “values rationality” do?
If the solution is to build agents that “value rationality,” can you explain how to do that? If it’s something so simple as to be analogous to adding brakes to a car, as opposed to, say, programming the car to be able to drive itself (let alone something much more complicated,) then it shouldn’t be so difficult to describe how to do it.
Have to, logically. Like even numbers have to be divisible,
How do we recognise anything? They have behaviour and characteristics which match the definition.
For itself. I do not accept that rationality can only be instrumental, a means to an end.
The kind of thing EY, the CFAR and other promoters of rationality urge people to do.
In the same kind of very broad terms that MIRI can explain how to build Artificial Obsessive Compulsives.
The analogy was not about simplicity. Illustrative analogies are always simpler than what they are illustrating: that is where their usefulness lies.