Turry’s and Clippy’s AI architectures are unspecified, so we don’t really know how they work or what they are optimizing.
I don’t like your assumption that runaway reinforcement learners are safe. If it acquires the subgoal of self-preservation (you can’t get more reward if you are dead), then it might still end up destroying humanity anyway (we could be a threat to it.)
I don’t think they’re necessarily safe. My original puzzlement was more that I don’t understand why we keep holding the AI’s value system constant when moving from pre-foom to post-foom. It seemed like something was being glossed over when a stupid machine goes from making paperclips to a being a god that makes paperclips. Why would a god just continue to make paperclips? If it’s super intelligent, why wouldn’t it figure out why it’s making paperclips and extrapolate from that? I didn’t have the language to ask “what’s keeping the value system stable through that transition?” when I made my original comment.
It depends on the AI architecture. A reinforcement learner always has the goal of maximizing it’s reward signal. It never really had a different goal, there was just something in the way (e.g. a paperclip sensor.)
But there is no theoretical reason you can’t have an AI that values universe-states themselves. That actually wants the universe to contain more paperclips, not merely to see lots of paperclips.
And if it did have such a goal, why would it change it? Modifying it’s code to make it not want paperclips, would hurt it’s goal. It would only ever do things that help it achieve it’s goal. E.g. making itself smarter. So eventually you end up with a superintelligent AI, that is still stuck with the narrow stupid goal of paperclips.
But there is no theoretical reason you can’t have an AI that values universe-states themselves.
How would that work? How do you have a learner that doesn’t have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality.
And if it did have such a goal, why would it change it?
Agreed. I am trying to get someone to explain how such a goal would work.
Well that’s the quadrillion dollar question. I have no idea how to solve it.
It’s certainly not impossible as humans seem to work this way. We can also do it in toy examples. E.g. a simple AI which has an internal universe it tries to optimize, and it’s sensors merely update the state it is in. Instead of trying to predict the reward, it tries to predict the actual universe state and selects the ones that are desirable.
I’m not sure that I understand your toy AI. What do you mean that it has “an internal universe it tries to optimize?” Do the sensors sense the state of the internal universe? Would “internal state” work as a synonym for “internal universe” or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?
I don’t think that humans are pure reinforcement learners. We have all sorts of complicated values that aren’t just eating and mating.
The toy AI has an internal model of the universe. In the extreme, a complete simulation of every atom and every object. It’s sensors update the model, helping it get more accurate predictions/more certainty about the universe state.
Instead of a utility function that just measures some external reward signal, it has an internal utility function which somehow measures the universe model and calculates utility from it. E.g. a function which counts the number of atoms arranged in paperclip shaped objects in the simulation.
It then chooses actions that lead to the best universe states. Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn’t lead to real paperclips.
Obviously a real universe model would be highly compressed. It would have a high level representation for paperclips rather than an atom by atom simulation.
I suspect this is how humans work. We can value external objects and universe states. People care about things that have no effect on them.
I don’t think that humans are pure reinforcement learners. We have all sorts of complicated values that aren’t just eating and mating.
We may not be pure reinforcement learners, but the presence of values other than eating and mating isn’t a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that’s creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don’t have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn’t leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.).
The toy AI has an internal model of the universe, it has an internal utility function which somehow measures the universe model and calculates utility from it....[toy AI is actually paperclip optimizer]...Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn’t lead to real paperclips.
I think we’ve been here before ;-)
Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I’m describing aren’t really problems.
But that’s the thing. There is no sensory input for “social deference”. It has to be inferred from an internal model of the world itself inferred from sensory data.
Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can’t use it for social instincts or morality, or anything you can’t just build a simple sensor to detect.
But that’s the thing. There is no sensory input for “social deference”. It has to be inferred from an internal model of the world itself inferred from sensory data...Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can’t use it for social instincts or morality, or anything you can’t just build a simple sensor to detect.
Why does it only work on simple signals? Why can’t the result of inference work for reinforcement learning?
Turry’s and Clippy’s AI architectures are unspecified, so we don’t really know how they work or what they are optimizing.
I don’t like your assumption that runaway reinforcement learners are safe. If it acquires the subgoal of self-preservation (you can’t get more reward if you are dead), then it might still end up destroying humanity anyway (we could be a threat to it.)
I don’t think they’re necessarily safe. My original puzzlement was more that I don’t understand why we keep holding the AI’s value system constant when moving from pre-foom to post-foom. It seemed like something was being glossed over when a stupid machine goes from making paperclips to a being a god that makes paperclips. Why would a god just continue to make paperclips? If it’s super intelligent, why wouldn’t it figure out why it’s making paperclips and extrapolate from that? I didn’t have the language to ask “what’s keeping the value system stable through that transition?” when I made my original comment.
It depends on the AI architecture. A reinforcement learner always has the goal of maximizing it’s reward signal. It never really had a different goal, there was just something in the way (e.g. a paperclip sensor.)
But there is no theoretical reason you can’t have an AI that values universe-states themselves. That actually wants the universe to contain more paperclips, not merely to see lots of paperclips.
And if it did have such a goal, why would it change it? Modifying it’s code to make it not want paperclips, would hurt it’s goal. It would only ever do things that help it achieve it’s goal. E.g. making itself smarter. So eventually you end up with a superintelligent AI, that is still stuck with the narrow stupid goal of paperclips.
How would that work? How do you have a learner that doesn’t have something equivalent to a reinforcement mechanism? At the very least it seems like there has to be some part of the AI that compares the universe-state to the desired-state and that the real goal is actually to maximize the similarity of those states which means modifying the goal would be easier than modifying reality.
Agreed. I am trying to get someone to explain how such a goal would work.
Well that’s the quadrillion dollar question. I have no idea how to solve it.
It’s certainly not impossible as humans seem to work this way. We can also do it in toy examples. E.g. a simple AI which has an internal universe it tries to optimize, and it’s sensors merely update the state it is in. Instead of trying to predict the reward, it tries to predict the actual universe state and selects the ones that are desirable.
Yeah, I think this whole thread may be kind of grinding to this conclusion.
Seem to perhaps, but I don’t think that’s actually the case. I think (as mentioned above) that we value reward signals terminally (but are mostly unaware of this preference) and nothing else. There’s another guy in this thread who thinks we might not have any terminal values.
I’m not sure that I understand your toy AI. What do you mean that it has “an internal universe it tries to optimize?” Do the sensors sense the state of the internal universe? Would “internal state” work as a synonym for “internal universe” or is this internal universe a representation of an external universe? Is this AI essentially trying to develop an internal model of the external universe and selecting among possible models to try and get the most accurate representation?
I don’t think that humans are pure reinforcement learners. We have all sorts of complicated values that aren’t just eating and mating.
The toy AI has an internal model of the universe. In the extreme, a complete simulation of every atom and every object. It’s sensors update the model, helping it get more accurate predictions/more certainty about the universe state.
Instead of a utility function that just measures some external reward signal, it has an internal utility function which somehow measures the universe model and calculates utility from it. E.g. a function which counts the number of atoms arranged in paperclip shaped objects in the simulation.
It then chooses actions that lead to the best universe states. Stuff like changing its utility function or fooling its sensors would not be chosen because it knows that doesn’t lead to real paperclips.
Obviously a real universe model would be highly compressed. It would have a high level representation for paperclips rather than an atom by atom simulation.
I suspect this is how humans work. We can value external objects and universe states. People care about things that have no effect on them.
We may not be pure reinforcement learners, but the presence of values other than eating and mating isn’t a proof of that. Quite the contrary, it demonstrates that either we have a lot of different, occasionally contradictory values hardwired or that we have some other system that’s creating value systems. From an evolutionary standpoint reward systems that are good at replicating genes get to survive, but they don’t have to be free of other side effects (until given long enough with a finite resource pool maybe). Pure, rational reward seeking is almost certainly selected against because it doesn’t leave any room for replication. It seems more likely that we have a reward system that is accompanied by some circuits that make it fire for a few specific sensory cues (orgasms, insulin spikes, receiving social deference, etc.).
I think we’ve been here before ;-)
Thanks for trying to help me understand this. Gram_Stone linked a paper that explains why the class of problems that I’m describing aren’t really problems.
But that’s the thing. There is no sensory input for “social deference”. It has to be inferred from an internal model of the world itself inferred from sensory data.
Reinforcement learning works fine when you have a simple reward signal you want to maximize. You can’t use it for social instincts or morality, or anything you can’t just build a simple sensor to detect.
Why does it only work on simple signals? Why can’t the result of inference work for reinforcement learning?