If you want, it would help me learn to write better, for you to list off all the words (or sentences) that confused you.
I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.
What I really need though is a all encompassing, rigid definition of a ‘terminal goal’ - what is and isn’t a terminal goal. Because “it’s a goal which is instrumental to no other goal” just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
Never the less some of your formatting was confusing to me, for example a few replies back you wrote:
As for the case of idealized terminal-goal-pursuers, any two terminal goals can be combined into one, e.g. {paperclip-amount×2 + stamp-amount} or {if can create a black hole with p>20%, do so, else maximize stamps}, etc.
The bit ” {paperclip-amount×2 + stamp-amount}” and ” {if can create a black hole with p>20%, do so, else maximize stamps}” was and is very hard for me to understand. If it was presented in plain English, I’m confident I’d understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you’re making.
Also the phrase “idealized terminal-goal-pursuers” underspecifies what you mean by ‘idealized’? I can think of at least four possible senses you might be gesturing to:
A. a terminal-goal-pursuer who’s terminal goals are “simple” enough to lend themselves as good candidates for a thought experiment—therefore ideal from the point of view of a teacher and a student.
B. ideal as in extremely instrumentally effective in accomplishing their goals,
C. ideal as in they encapsulate the perfect undiluted ‘ideal’ of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a ‘platonic ideal/essence’ as opposed to a platonic appearance,
D. “idealized” as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?
E. “idealized” in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.
I’m not sure which you mean. But suspect it’s none-of-the-above.
For the record, I know you absolutely don’t mean “ideal” as in “moral ideal”. Nor in an Aesthetic or Freudian sense, like when a teenager “idealizes” their favourite pop-star and raves on about how perfect they are in every way
But going back to my confusion over terminal goals, and what is or isn’t:
For example: “I value paperclips. I also value stamps, but one stamp is only half as valuable as a paperclip to me” → “I have the single value of maximizing this function over the world: {paperclip-amount×2 + stamp-amount}”. (It’s fine to think of it in either way)
I’m not sure what this statement is saying, because that describes a possibly very human attribute—that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a ‘terminal’ goal to mean. The examples in the video describe very “single-minded” entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.
There’s a few assumptions I’m making here: that a terminal goal is “fixed” or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.
However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it’s terminal goals?
consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
Why is it most likely that it [...]
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
I would love to render any assistance I can in that regard, but my fear is this is probably more of a me-problem than a general problem with your writing.
What I really need though is a all encompassing, rigid definition of a ‘terminal goal’ - what is and isn’t a terminal goal. Because “it’s a goal which is instrumental to no other goal” just makes it feel like the definition ends wherever you want it to. Because, consider a system which is capable of self-modification and changing it’s own goals, now the difference between an instrumental goal and a terminal goal erodes.
Never the less some of your formatting was confusing to me, for example a few replies back you wrote:
The bit ” {paperclip-amount×2 + stamp-amount}” and ” {if can create a black hole with p>20%, do so, else maximize stamps}” was and is very hard for me to understand. If it was presented in plain English, I’m confident I’d understand it. But using computer-code-esque variables, especially when they are not assigned values introduces a point of failure for my understanding. Because now I need to understand your formatting, and the pseudo-code correctly (and as not a coder, I struggle to read pseudo-code at the best of times) just to understand the allusion you’re making.
Also the phrase “idealized terminal-goal-pursuers” underspecifies what you mean by ‘idealized’? I can think of at least four possible senses you might be gesturing to:
A. a terminal-goal-pursuer who’s terminal goals are “simple” enough to lend themselves as good candidates for a thought experiment—therefore ideal from the point of view of a teacher and a student.
B. ideal as in extremely instrumentally effective in accomplishing their goals,
C. ideal as in they encapsulate the perfect undiluted ‘ideal’ of a terminal goal (and therefore it is possible to have pseudo-terminial goals) - i.e. a ‘platonic ideal/essence’ as opposed to a platonic appearance,
D. “idealized” as in that these are purely theoretical beings (at this point in time) - because while humans may have terminal goals, they are not particularly good or pure examples of terminal-goal-havers? The same for any extant system we may ascribe goals to?
E. “idealized” in a combination of A and B which is very specific to entities that have multiple terminal goals, which is unlikely, but for the sake of argument if they did have two or more terminal goals would display certain behaviors.
I’m not sure which you mean. But suspect it’s none-of-the-above.
For the record, I know you absolutely don’t mean “ideal” as in “moral ideal”. Nor in an Aesthetic or Freudian sense, like when a teenager “idealizes” their favourite pop-star and raves on about how perfect they are in every way
But going back to my confusion over terminal goals, and what is or isn’t:
I’m not sure what this statement is saying, because that describes a possibly very human attribute—that we may have two terminal goals in that they are not subservient or means of pursuing anything else. Which is what I understand a ‘terminal’ goal to mean. The examples in the video describe very “single-minded” entities that have a single terminal goal that they seek to optimize, like a stamp collecting machine.
There’s a few assumptions I’m making here: that a terminal goal is “fixed” or permanent. You see when I said sufficiently superintelligent entities would converge on certain values, I was assuming that they would have some kind of self-modification abilities. And therefore their terminal values would look a lot like common convergent instrumental values of other, similarly self-adapting/improving/modifying entities.
However if this is not a terminal goal, then what is a terminal goal? And for a system that is capable of adapting and improving itself, what would be it’s terminal goals?
Is terminal goal simply a term of convenience?
If an entity’s terminal goal is to maximize paperclips, it would not self-modify into a stamp maximizer, because that would not satisfy the goal (except in contrived cases where doing that is the choice that maximizes paperclips). A terminal goal is a case of criteria according to which actions are chosen; “self-modify to change my terminal goal” is an action.
But isn’t there almost always a possibility of a entity goodharting to change it’s definition of what consitutes a paperclip that is easier for it to maximize? How does it internally represent what is a paperclip? How broad is that definition? What power does it have over it’s own “thinking” (sorry to anthropamorphize) does it have to change how it represents the things which that representation relies on?
Why is it most likely that it will have an immutable, unchanging, and unhackable terminal goal? What assumptions underpin that as more likely than fluid or even conflicting terminal goals which may cause radical self-modifications?
What does “a case of criteria” mean?
Same thing applies. “Does that fulfill the current goal-definition?” (Note this is not a single question; we can ask this about each possible goal-definition)
This was about an abstract definition of an agent (not itself a prediction, but does say something about a space of math, that we might end up in). There are surely possible programs which would exhibit any behavior, although some look harder to program (or ‘less natural’): for example, “an entity that is a paperclip maximizer for 100 years, then suddenly switches to maximizing stamps” looks harder to program (if an embedded agent) because you’d need to find a method where it won’t just self-modify to never turn into a stamp maximizer (as turning into one would prevent if from maximizing paperclips), or to not unleash a true paperclip maximizer and shut itself down if you rule out just self-modification (and so on if you were to additionally rule out just that).[1]
(though very tangentially there is a simple way to do that)