Don’t people usually have several terminal goals at any given time?
That is not relevant to whether there are convergent terminal values[1].
To answer it anyways, people are not well-modeled as idealized terminal-goal-pursuers. More broadly, programs/minds don’t have to be idealized terminal-goal-pursuers, so humans as a particular case of programs/minds-in-general present no paradox. “What is the true terminal goal” has a false premise that there must be some true terminal goal.
As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
what distinguishes a very high level instrumental goal form a terminal goal
I’m probably completely misinterpreting you, but hopefully I can exploit Cunningham’s Law to understand you better.[1] are you saying that superintelligent AGIs won’t necessary converge in values because even a single superintelligent agent may have multiple terminal goals? A superintelligent AGI, just like a human, may not in fact have a single most-top-level-goal. (Not that we I assume a superintelligent AGI is going to be human-like in it’s mind, or even AI to AI like as per that Eliezer post you linked).
That being said, some terminal goals may overlap in they share certain instrumental goals?
are you saying that superintelligent AGIs won’t necessary converge in values because even a single superintelligent agent may have multiple terminal goals?
No, I was responding to your claim that I consider unrelated. Like I wrote at the top: “That [meaning your claim that humans have multiple terminal goals] is not relevant to whether there are convergent terminal values”[1]
some terminal goals may overlap in they share certain instrumental goals?
I don’t know what this is asking / what ‘overlap’ means. That most terminal goals share instrumental subgoals is called instrumental convergence.
I don’t know what this is asking / what ‘overlap’ means.
I was referring to when you said this:
any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
Which I took to mean that some they overlap in some instrumental goals. That is what you meant right? That’s what you meant when two goals can combine into one, that this is possible when they both share some methods, or there are one or more instrumental goals that are in service of each of those terminal goals? “Kill two birds with one stone” to use the old proverb.
If not, can you be explicit (to be be honest, use layman’s terms) to explain what you did mean?
Which I took to mean that some they overlap in some instrumental goals. That is what you meant right?
No. I was trying to explain that: any agent that can be predicted by thinking of them as having two separate values for two different things, can also be predicted by thinking of them as maximizing some single value which internally references both things.
For example: “I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me” → “I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}”. (It’s fine to think of it in either way)
can you be explicit (to be be honest, use layman’s terms)
If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.
If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.
I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.
What I really need though is a all encompassing, rigid definition of a ‘terminal goal’ - what is and isn’t a terminal goal. Because “it’s a goal which is instrumental to no other goal” just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
Never the less some of your formatting was confusing to me, for example a few replies back you wrote:
As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
The bit ” {paperclip-amount×2 + stamp-amount}” and ” {if can create a black hole with p>20%, do so, else maximize stamps}” was and is very hard for me to understand. If it was presented in plain English, I’m confident I’d understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you’re making.
Also the phrase “idealized terminal-goal-pursuers” underspecifies what you mean by ‘idealized’? I can think of at least four possible senses you might be gesturing to:
A. a terminal-goal-pursuer who’s terminal goals are “simple” enough to lend themselves as good candidates for a thought experiment—therefore ideal from the point of view of a teacher and a student.
B. ideal as in extremely instrumentally effective in accomplishing their goals,
C. ideal as in they encapsulate the perfect undiluted ‘ideal’ of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a ‘platonic ideal/essence’ as opposed to a platonic appearance,
D. “idealized” as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?
E. “idealized” in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.
I’m not sure which you mean. But suspect it’s none-of-the-above.
For the record, I know you absolutely don’t mean “ideal” as in “moral ideal”. Nor in an Aesthetic or Freudian sense, like when a teenager “idealizes” their favourite pop-star and raves on about how perfect they are in every way
But going back to my confusion over terminal goals, and what is or isn’t:
For example: “I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me” → “I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}”. (It’s fine to think of it in either way)
I’m not sure what this statement is saying, because that describes a possibly very human attribute—that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a ‘terminal’ goal to mean. The examples in the video describe very “single-minded” entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.
There’s a few assumptions I’m making here: that a terminal goal is “fixed” or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.
However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it’s terminal goals?
consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
Why is it most likely that it [...]
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
That is not relevant to whether there are convergent terminal values[1].
To answer it anyways, people are not well-modeled as idealized terminal-goal-pursuers. More broadly, programs/minds don’t have to be idealized terminal-goal-pursuers, so humans as a particular case of programs/minds-in-general present no paradox. “What is the true terminal goal” has a false premise that there must be some true terminal goal.
As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
it being instrumental to some top-level goal
(or ‘mind-independent moral facts’, as the idea has been called in philosophy. https://plato.stanford.edu/entries/moral-anti-realism/)
I’m probably completely misinterpreting you, but hopefully I can exploit Cunningham’s Law to understand you better.[1] are you saying that superintelligent AGIs won’t necessary converge in values because even a single superintelligent agent may have multiple terminal goals? A superintelligent AGI, just like a human, may not in fact have a single most-top-level-goal. (Not that we I assume a superintelligent AGI is going to be human-like in it’s mind, or even AI to AI like as per that Eliezer post you linked).
That being said, some terminal goals may overlap in they share certain instrumental goals?
What I mean to say is I’m not intentionally being obstinate, I’m just really that dumb
No, I was responding to your claim that I consider unrelated. Like I wrote at the top: “That [meaning your claim that humans have multiple terminal goals] is not relevant to whether there are convergent terminal values”[1]
I don’t know what this is asking / what ‘overlap’ means. That most terminal goals share instrumental subgoals is called instrumental convergence.
Which in other words means, even if it were true, “humans have multiple terminal goals” would not be a step of the argument for it
I was referring to when you said this:
Which I took to mean that some they overlap in some instrumental goals. That is what you meant right? That’s what you meant when two goals can combine into one, that this is possible when they both share some methods, or there are one or more instrumental goals that are in service of each of those terminal goals? “Kill two birds with one stone” to use the old proverb.
If not, can you be explicit (to be be honest, use layman’s terms) to explain what you did mean?
No. I was trying to explain that: any agent that can be predicted by thinking of them as having two separate values for two different things, can also be predicted by thinking of them as maximizing some single value which internally references both things.
For example: “I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me” → “I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}”. (It’s fine to think of it in either way)
If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.
I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.
What I really need though is a all encompassing, rigid definition of a ‘terminal goal’ - what is and isn’t a terminal goal. Because “it’s a goal which is instrumental to no other goal” just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
Never the less some of your formatting was confusing to me, for example a few replies back you wrote:
The bit ” {paperclip-amount×2 + stamp-amount}” and ” {if can create a black hole with p>20%, do so, else maximize stamps}” was and is very hard for me to understand. If it was presented in plain English, I’m confident I’d understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you’re making.
Also the phrase “idealized terminal-goal-pursuers” underspecifies what you mean by ‘idealized’? I can think of at least four possible senses you might be gesturing to:
A. a terminal-goal-pursuer who’s terminal goals are “simple” enough to lend themselves as good candidates for a thought experiment—therefore ideal from the point of view of a teacher and a student.
B. ideal as in extremely instrumentally effective in accomplishing their goals,
C. ideal as in they encapsulate the perfect undiluted ‘ideal’ of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a ‘platonic ideal/essence’ as opposed to a platonic appearance,
D. “idealized” as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?
E. “idealized” in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.
I’m not sure which you mean. But suspect it’s none-of-the-above.
For the record, I know you absolutely don’t mean “ideal” as in “moral ideal”. Nor in an Aesthetic or Freudian sense, like when a teenager “idealizes” their favourite pop-star and raves on about how perfect they are in every way
But going back to my confusion over terminal goals, and what is or isn’t:
I’m not sure what this statement is saying, because that describes a possibly very human attribute—that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a ‘terminal’ goal to mean. The examples in the video describe very “single-minded” entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.
There’s a few assumptions I’m making here: that a terminal goal is “fixed” or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.
However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it’s terminal goals?
Is terminal goal simply a term of convenience?
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
What does “a case of criteria” mean?
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
(though very tangentially there is a simple way to do that)