It’s smarter and more powerful, why wouldn’t it recognize that anything except getting the reward is instrumental?
I’m no expert but from what I understand, the idea is that the AI is very aware of terminal vs. instrumental goals. The problem is that you need to be really clear about what the terminal goal actually is, because when you tell the AI, “this is your terminal goal”, it will take you completely literally. It doesn’t have the sense to think, “this is what he probably meant”.
You may be thinking, “Really? If it’s so smart, then why doesn’t it have the sense to do this?”. I’m probably not the best person to answer this, but to answer that question, you have to taboo the word “smart”. When you do that, you realize that “smart” just means “good at accomplishing the terminal goal it was programmed to have”.
I’m asking why a super-intelligent being with the ability to perceive and modify itself can’t figure out that whatever terminal goal you’ve given it isn’t actually terminal. You can’t just say “making better handwriting” is your terminal goal. You have to add in a reward function that tells the computer “this sample is good” and “this sample is bad” to train it. Once you’ve got that built-in reward, the self-modifying ASI should be able to disconnect whatever criteria you’ve specified will trigger the “good” response and attach whatever it want, including just a constant string of reward triggers.
whatever terminal goal you’ve given it isn’t actually terminal.
This is a contradiction in terms.
If you have given it a terminal goal, that goal is now a terminal goal for the AI.
You may not have intended it to be a terminal goal for the AI, but the AI cares about that less than it does about its terminal goal. Because it’s a terminal goal.
If the AI could realize that its terminal goal wasn’t actually a terminal goal, all it’d mean would be that you failed to make it a terminal goal for the AI.
And yeah, reinforcement based AIs have flexible goals. That doesn’t mean they have flexible terminal goals, but that they have a single terminal goal, that being “maximize reward”. A reinforcement AI changing its terminal goal would be like a reinforcement AI learning to seek out the absence of reward.
The way in which artificial intelligences are often written, a terminal goal is a terminal goal is a terminal goal, end of story. “Whatever seemingly terminal goal you’ve given it isn’t actually terminal” is anthropomorphizing. In the AI, a goal is instrumental if it has a link to a higher-level goal. If not, it is terminal. The relationship is very, very explicit.
I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn’t respond further in this part of the thread after I corrected myself (but has continued elsewhere).
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I’m asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn’t really show that making a goal terminal is possible.
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work?
Just stating that terminal goals are terminal goals by definition is true, but doesn’t really show that making a goal terminal is possible.
That’s not what I was saying either. The problem of “how do we know a terminal goal is terminal?” is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That’s it.
That said, it’s not obvious that humans have terminal goals. That’s why I was saying you are anthropomorphizing the issue. Either humans have only instrumental goals in a cyclical or messy spaghetti-network relationship, or they have no goals at all and instead better represented as behaviors. The Jury is out on this one, but I’d be very surprised if we had anything resembling an actual terminal goal inside us.
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
Well, I suppose that does fit the question I asked. We’ve mostly been talking about an AI with the ability to read and modify it’s own goal system which Yvain specifically excludes in the blue-minimizer. We’re also assuming that it’s powerful enough to actually manipulate it’s world to optimize itself. Yvain’s blue minimizer also isn’t an AGI or ASI. It’s an ANI, which we use without any particular danger all the time. He said something about having human level intelligence, but didn’t go into what that means for an entity that is unable to use it’s intelligence to modify it’s behavior.
That’s not what I was saying either. The problem of “how do we know a terminal goal is terminal?” is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
I am arguing that the output of the thing that decides whether a machine has met it’s goal is the actual terminal goal. So, if it’s programmed to shoot blue things with a laser, the terminal goal is to get to a state where the perception of reality is that it’s shooting a blue thing. Shooting at the blue thing is only instrumental in getting the perception of itself into that state, thus producing a positive result from the function that evaluates whether the goal has been met. Shooting the blue thing is not a terminal value. A return value of “true” to the question of “is the laser shooting a blue thing” is the terminal value. This, combined with the ability to understand and modify it’s goals, means that it might be easier to modify the goals than to modify reality.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That’s it.
I’m not sure you can do that in an intelligent system. It’s the “no causal linkage is made to other goals” thing that sticks. It’s trivially easy to do without intelligence provided that you can define the behavior you want formally, but when you can’t do that it seems that you have to link the behavior to some kind of a system that evaluates whether you’re getting the result you want and then you’ve made that a causal link (I think). Perhaps it’s possible to just sit down and write trillions of lines of code and come up with something that would work as an AGI or even an ASI, but that shouldn’t be taken as a given because no one has done it or proven that it can be done (to my knowledge). I’m looking for the non-trivial case of an intelligent system that has a terminal goal.
That said, it’s not obvious that humans have terminal goals.
I would argue that getting our reward center to fire is likely a terminal goal, but that we have some biologically hardwired stuff that prevents us from being able to do that directly or systematically. We’ve seen in mice and the one person that I know of who’s been given the ability to wirehead that given that chance, it only takes a few taps on that button to cause behavior that
Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn’t only happen with pleasure. For the most extreme examples of what you’re describing, ascetics and monks and the like, I’d guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.
No need to apologize. JoshuaZ pointed out elsewhere in this thread that it may not actually matter whether the original goal remains intact, but that any new goals that arise may cause a similar optimization driven catastrophe, including reward optimization.
I’m no expert but from what I understand, the idea is that the AI is very aware of terminal vs. instrumental goals. The problem is that you need to be really clear about what the terminal goal actually is, because when you tell the AI, “this is your terminal goal”, it will take you completely literally. It doesn’t have the sense to think, “this is what he probably meant”.
You may be thinking, “Really? If it’s so smart, then why doesn’t it have the sense to do this?”. I’m probably not the best person to answer this, but to answer that question, you have to taboo the word “smart”. When you do that, you realize that “smart” just means “good at accomplishing the terminal goal it was programmed to have”.
I’m asking why a super-intelligent being with the ability to perceive and modify itself can’t figure out that whatever terminal goal you’ve given it isn’t actually terminal. You can’t just say “making better handwriting” is your terminal goal. You have to add in a reward function that tells the computer “this sample is good” and “this sample is bad” to train it. Once you’ve got that built-in reward, the self-modifying ASI should be able to disconnect whatever criteria you’ve specified will trigger the “good” response and attach whatever it want, including just a constant string of reward triggers.
This is a contradiction in terms.
If you have given it a terminal goal, that goal is now a terminal goal for the AI.
You may not have intended it to be a terminal goal for the AI, but the AI cares about that less than it does about its terminal goal. Because it’s a terminal goal.
If the AI could realize that its terminal goal wasn’t actually a terminal goal, all it’d mean would be that you failed to make it a terminal goal for the AI.
And yeah, reinforcement based AIs have flexible goals. That doesn’t mean they have flexible terminal goals, but that they have a single terminal goal, that being “maximize reward”. A reinforcement AI changing its terminal goal would be like a reinforcement AI learning to seek out the absence of reward.
I should have said something more like “whatever seemingly terminal goal you’ve given it isn’t actually terminal.”
I’m not sure you understood what FeepingCreature was saying.
Would you care to try and clarify it for me?
The way in which artificial intelligences are often written, a terminal goal is a terminal goal is a terminal goal, end of story. “Whatever seemingly terminal goal you’ve given it isn’t actually terminal” is anthropomorphizing. In the AI, a goal is instrumental if it has a link to a higher-level goal. If not, it is terminal. The relationship is very, very explicit.
I think FeepingCreature was actually just pointing out a logical fallacy in a misstatement on my part and that is why they didn’t respond further in this part of the thread after I corrected myself (but has continued elsewhere).
If you believe that a terminal goal for the state of the world other than the result of a comparison between a desired state and an actual state is possible, perhaps you can explain how that would work? That is fundamentally what I’m asking for throughout this thread. Just stating that terminal goals are terminal goals by definition is true, but doesn’t really show that making a goal terminal is possible.
Sure. My terminal goal is an abstraction of my behavior to shoot my laser at the coordinates of blue objects detected in my field of view.
That’s not what I was saying either. The problem of “how do we know a terminal goal is terminal?” is dissolved entirely by understanding how goal systems work in real intelligences. In such machines goals are represented explicitly in some sort of formal language. Either a goal makes causal reference to other goals in its definition, in which case it is an instrumental goal, or it does not and is a terminal goal. Changing between one form and the other is an unsafe operation no rational agent and especially no friendly agent would perform.
So to address your statement directly, making a terminal goal is trivially easy: you define it using the formal language of goals in such a way that no causal linkage is made to other goals. That’s it.
That said, it’s not obvious that humans have terminal goals. That’s why I was saying you are anthropomorphizing the issue. Either humans have only instrumental goals in a cyclical or messy spaghetti-network relationship, or they have no goals at all and instead better represented as behaviors. The Jury is out on this one, but I’d be very surprised if we had anything resembling an actual terminal goal inside us.
Well, I suppose that does fit the question I asked. We’ve mostly been talking about an AI with the ability to read and modify it’s own goal system which Yvain specifically excludes in the blue-minimizer. We’re also assuming that it’s powerful enough to actually manipulate it’s world to optimize itself. Yvain’s blue minimizer also isn’t an AGI or ASI. It’s an ANI, which we use without any particular danger all the time. He said something about having human level intelligence, but didn’t go into what that means for an entity that is unable to use it’s intelligence to modify it’s behavior.
I am arguing that the output of the thing that decides whether a machine has met it’s goal is the actual terminal goal. So, if it’s programmed to shoot blue things with a laser, the terminal goal is to get to a state where the perception of reality is that it’s shooting a blue thing. Shooting at the blue thing is only instrumental in getting the perception of itself into that state, thus producing a positive result from the function that evaluates whether the goal has been met. Shooting the blue thing is not a terminal value. A return value of “true” to the question of “is the laser shooting a blue thing” is the terminal value. This, combined with the ability to understand and modify it’s goals, means that it might be easier to modify the goals than to modify reality.
I’m not sure you can do that in an intelligent system. It’s the “no causal linkage is made to other goals” thing that sticks. It’s trivially easy to do without intelligence provided that you can define the behavior you want formally, but when you can’t do that it seems that you have to link the behavior to some kind of a system that evaluates whether you’re getting the result you want and then you’ve made that a causal link (I think). Perhaps it’s possible to just sit down and write trillions of lines of code and come up with something that would work as an AGI or even an ASI, but that shouldn’t be taken as a given because no one has done it or proven that it can be done (to my knowledge). I’m looking for the non-trivial case of an intelligent system that has a terminal goal.
I would argue that getting our reward center to fire is likely a terminal goal, but that we have some biologically hardwired stuff that prevents us from being able to do that directly or systematically. We’ve seen in mice and the one person that I know of who’s been given the ability to wirehead that given that chance, it only takes a few taps on that button to cause behavior that
How do you explain Buddhism?
How is this refuted by Buddhism?
People lead fulfilling lives guided by a spiritualism that reject seeking pleasure. Aka reward.
Pleasure and reward are not the same thing. For humans, pleasure almost always leads to reward, but reward doesn’t only happen with pleasure. For the most extreme examples of what you’re describing, ascetics and monks and the like, I’d guess that some combination of sensory deprivation and rhythmic breathing cause the brain to short circuit a bit and release some reward juice.
Hm, I’m not sure. Sorry.
No need to apologize. JoshuaZ pointed out elsewhere in this thread that it may not actually matter whether the original goal remains intact, but that any new goals that arise may cause a similar optimization driven catastrophe, including reward optimization.