With the smiley faces, I am referring to disagreement with Hibbard, summarized e.g. here on wikipedia
Secondly, human beings often function as value learners (“learn what is good via reinforcement, but prefer a value system you’re very sure about over a reward that seems to contradict the learned values”) rather than reinforcement learners. Value learners, in fact, are the topic of a machine ethics paper from 2011, by Daniel Dewey.
You’re speaking as if value learners were not a subtype of reinforcement learners.
For a sufficiently advanced AI, i.e. one that learns to try different counter-factual actions on a world model, it is essential to build a model of the reward, which is to be computed on the counter-factual actions. It’s this model of the reward that is specifying which action gets chosen.
Yes, any given hypothesis a learner has about a target function is only correct to within some probability of error. But that probability can be very small.
Looks like presuming a super-intelligence from the start.
With the smiley faces, I am referring to disagreement with Hibbard, summarized e.g. here on wikipedia
Right, and that wikipedia article refers to stuff Eliezer was writing more than ten years ago. That stuff is nowhere near state-of-the-art machine ethics.
(I think this weekend I might as well blog some decent verbal explanations of what is usually going on in up-to-date machine ethics on here, since a lot of people appear to confuse real, state-of-the-art work with either older, superseded ideas or very intuitive fictions.
Luckily, it’s a very young field, so it’s actually possible for some bozo like me to know a fair amount about it.)
You’re speaking as if value learners were not a subtype of reinforcement learners.
That’s because they are not. These are precise mathematical terms being used here, and while they are similar (for instance, I’d consider a Value Learner closer to a reinforcement learner than to a fixed direct-normativity utility function), they’re not identical, neither is one a direct supertype of the other.
For a sufficiently advanced AI, i.e. one that learns to try different counter-factual actions on a world model, it is essential to build a model of the reward, which is to be computed on the counter-factual actions. It’s this model of the reward that is specifying which action gets chosen.
This intuition is correct, regarding reinforcement learners. It is slightly incorrect regarding value learners, but how precisely it is incorrect is at the research frontier.
Looks like presuming a super-intelligence from the start.
No, I didn’t say the target function was so complex as to require superintelligence. If I have a function f(x) = x + 1, a learner will be able to learn that this is the target function to within a very low probability of error, very quickly, precisely because of its simplicity.
The simpler the target function, the less training data needed to learn it in a supervised paradigm.
Right, and that wikipedia article refers to stuff Eliezer was writing more than ten years ago. That stuff is nowhere near state-of-the-art machine ethics.
I think I seen him using smiley faces as example much more recently, that’s why I thought of it as an example, but can’t find the link.
These are precise mathematical terms being used here
The field of reinforcement learning is far too diverse for these to be “precise mathematical terms”.
The simpler the target function, the less training data needed to learn it in a supervised paradigm.
I thought you were speaking of things like learning an alternative way to produce a button press.
I thought you were speaking of things like learning an alternative way to produce a button press.
Here’s where things like deep learning come in.
Deep learning learns features from the data. The better your set of features, the less complex the true target function is when phrased in terms of those features. However, features themselves can contain a lot of internal complexity.
So, for instance, “press the button” is a very simple target from our perspective, because we already possess abstractions for “button” and “press” and also the ability to name one button as “the button”. Our minds contain a whole lot of very high-level features, some of which we’re born with and some of which we’ve learned over a very long time (by computer-science standards, 18 years of training to produce an adult from an infant is an aeon) using some of the world’s most intelligent deep-learning apparatus (ie: our brains).
Hence the fable of the “dwim” program, which is written in the exact same language of features your mind uses, and which therefore is the Do What I Mean program. This is also known as a Friendly AI.
The point is that the AI is spending a lot of time learning how to make the human press the button. Which results in a model of the human value, used as the reward calculation for the alternative actions.
Granted, there is a possibility of over-fitting of sorts, where the AI proceeds to make rewards more directly—pressing the button if it’s really stupid, soldering together the wires if it’s a little smarter, altering the memory and cpu to sublime into the eternal bliss in a finite time, if it’s really really clever.
Granted, there is a possibility of over-fitting of sorts, where the AI proceeds to make rewards more directly—pressing the button if it’s really stupid, soldering together the wires if it’s a little smarter, altering the memory and cpu to sublime into the eternal bliss in a finite time, if it’s really really clever.
This is exactly why we consider reinforcement learners Unfriendly. A sufficiently smart agent would eventually figure out that what rewards it is not the human’s intent to press the button, but in fact the physical pressing of the button itself, and then, yes, the electrical signal sent by physically pressing the button, blah blah blah.
Its next move would then be to get some robotic arm or foolish human janitor to duct-tape the button in the pressed position. Unfortunately for us, this would not cause it to “bliss out” if it was constructed as a rational learning agent, so it would then proceed to take actions to stop anyone from ever removing the duct-tape.
A sufficiently smart agent would eventually figure out that what rewards it is not the human’s intent to press the button, but in fact the physical pressing of the button itself,
Look, the algorithm that’s adjusting the network weights, it’s really dull. You keep confusing how smart the neural network becomes, with how good the weight adjustment algorithm is.
and then, yes, the electrical signal sent by physically pressing the button, blah blah blah.
and it’s not the clock on the wall that makes the utility sum over time, yes?
so it would then proceed to take actions to stop anyone from ever removing the duct-tape.
One hell of a stupid AI that didn’t even solder together the wires (in case duct tape un-peels), and couldn’t directly set the network values where they’ll be after an infinite time of reward. There’s nothing about “rational” that says “solve a mathematical problem in the same way a dull ape which confused mathematical constraints with the feeling of pleasure would”.
One hell of a stupid AI that didn’t even solder together the wires (in case duct tape un-peels), and couldn’t directly set the network values where they’ll be after an infinite time of reward. There’s nothing about “rational” that says “solve a mathematical problem in the same way a dull ape which confused mathematical constraints with the feeling of pleasure would”.
Do you agree that the way time affects utility is likewise manipulated? The AI has no utility to gain from protecting the duct tape once it has found the way to bypass the button, and it has no utility to gain from protecting the future self once it bypassed the mechanisms tying reward to time (i.e. the clock).
Ohh, by the way, this behaviour probably needs a name… wire-clocking maybe? I came up with the idea on my own a while back but I doubt I’d be the first, it’s not a very difficult insight.
Might write an article for my site. I don’t think said “greater experts” are particularly exceptional at anything other than messiah complex. Here’s something I wrote about that before . My opinion about this general sort of phenomenon is that people get an internally administered reinforcement for intellectual accomplishments, which sometimes mis-trains the network to see great insights where there are none.
I don’t think said “greater experts” are particularly exceptional at anything other than messiah complex.
I didn’t mean him ;-). There are actual journals and conferences where you could publish this sort of result with real peer review, but generally this site would be a good place to get people to point out the embarrassing-level mistakes before you face a review committee.
Try to separate between the problems of AI and the person of, say, Eliezer Yudkowsky. Remember, it was Juergen Schmidhuber, who is in fact the reigning Real Expert on AGI, who said the creation of AI would lead to a massive war between superintelligences in which right and wrong would be defined in retrospect by the winners; so we’ve kinda got a stake in this.
but generally this site would be a good place to get people to point out the embarrassing-level mistakes before you face a review committee.
I’d run it by people I know who are not cherry-picked to have rather unusual views.
Remember, it was Juergen Schmidhuber, who is in fact the reigning Real Expert on AGI, who said the creation of AI would lead to a massive war between superintelligences in which right and wrong would be defined in retrospect by the winners; so we’ve kinda got a stake in this.
He’s hardly the only expert. The war really seems at odds with the notion that AI undergoes rapid hard takeoff, anyhow.
edit: Thing is, opinions are somewhat stochastic, i.e. for something that’s wrong there will be some small number of experts that believe it, and so their mere presence doesn’t provide much evidence.
edit2: also, I don’t believe “rational reward maximization” is what a learning AI ends up doing, except maybe for theoretical constructs such as AIXI. Mostly the reward signal doesn’t work remotely like rational expected utility.
I’d run it by people I know who are not cherry-picked to have rather unusual views.
A good point. Do you perhaps know some? Unfortunately, AI is a very divided field on the subject of predicting what actual implementations of proposed algorithms will really do.
He’s hardly the only expert.
Please, find me a greater expert in AGI than Juergen Schmidhuber. Someone with more publications in peer-reviewed journals, more awards, more victories at learning competitions, more grants given by committees of tenured professors. Shane Legg and Marcus Hutter worked in his lab.
As we normally define credibility (ie: a very credible scientist is one with many publications and grants who works as a senior, tenured professor at a state-sponsored university), Schmidhuber is probably the most credible expert on this subject, as far as I’m aware.
A good point. Do you perhaps know some? Unfortunately, AI is a very divided field on the subject of predicting what actual implementations of proposed algorithms will really do.
I’d talk with some mathematicians.
Please, find me a greater expert in AGI than Juergen Schmidhuber.
Interestingly in the quoted piece he said he doesn’t think friendly AI is possible, and endorsed both the hard take-off (perhaps he means something different by this) and AI wars...
By the way I’d support his group as far as ‘safety’ goes: neural networks would seem particularly unlikely to undergo said “hard take-off”, and assuming gradual improvement, before the AI that goes around killing everyone, in the lines of AIs that tend not to learn what we want, we’d be getting an AI which (for example) whines very annoyingly just like my dog right now does, and for all the pattern recognition powers, can’t even get into the cupboard with the dog food. Getting stuck in a local maximum where annoying approaches are not explored, is a desirable feature in a learning process.
Interestingly in the quoted piece he said he doesn’t think friendly AI is possible
And this is where I’d disagree with him, being probably more knowledgeable in machine ethics than him. Ethical AI is difficult, but I would argue it’s definitely possible. That is, I don’t believe human notions of goodness are so completely, utterly incoherent that we will hate any and all possible universes into which we are placed, and certainly there have existed humans who loved their lives and their world.
If we don’t hate all universes and we love some universes, then the issue is just locating the universes we love and sifting them out from the ones we hate. That might be very difficult, but I don’t believe it’s impossible.
endorsed both the hard take-off (perhaps he means something different by this) and AI wars...
He did design the non-neural Goedel Machine to basically make a hard take-off happen. On purpose. He’s a man of immense chutzpah, and I mean that with all possible admiration.
That is, I don’t believe human notions of goodness are so completely, utterly incoherent
The problem is that as a rational “utility function” things like human desires, or pain, must be defined down at the basic level of computational operations performed by human brains (and the ‘computational operations performed by something’ might itself not even be a definable concept).
Then there’s also ontology issue.
All the optimality guarantees for things like Solomonoff Induction are for predictions, not for the internal stuff inside the model—works great for pressing your button, not so much for determining what people exists and what they want.
For the same observable data, there’s the most probable theory, but there’s also a slightly more complex theory which has far more people at stake. Picture a rather small modification to the theory which multiple-invokes the original theory and makes an enormous number of people get killed depending on the number of anti-protons in this universe, or other such variable that the AI can influence. There’s a definite potential of getting, say, an antimatter maximizer or blackhole minimizer or something equally silly from a provably friendly AI that maximizes expected value over an ontology that has a subtle flaw. Proofs do not extend to checking the sanity of assumptions.
He did design the non-neural Goedel Machine to basically make a hard take-off happen. On purpose. He’s a man of immense chutzpah, and I mean that with all possible admiration.
To be honest, I just fail to be impressed with things such as AIXI or Goedel machine (which admittedly is cooler than the former).
I see as main obstacle to that kind of “neat AI” the reliance on extremely effective algorithms for things such as theorem proving (especially in the presence of logical uncertainty). Most people capable of doing such work would rather work on something that makes use of present and near future technologies. Things like Goedel machine seem to require far more power from the theorem prover than I would consider to be sufficient for the first person to create an AGI.
Yeah, took me a bit of time to figure that out also. The solution where the AI builds enormous amount of defences around itself just seemed quite imperfect—an asteroid might hit it before it builds defences, it might be in a simulation that gets shut-down...
I expect the presence of rogue behaviour to depend on the relation between learning algorithm and the learned data, though.
Suppose the learning algorithm builds up the intelligence by adjusting data in some Turing-complete representation, e.g. adjusting weight in a sufficiently advanced neural network which can have the weights set up so that the network is intelligent. Then the code that adjusts said parameters is not really part of the AI—it’s here for bootstrapping purposes, essentially, and the AI implemented in the neural network should not want to press the reward button unless it wants to self modify in precisely the way in which the reward modifies it.
What I expect is gradual progress, settling on the approaches and parameters that make it easy to teach the AI to do things, gradually improving how AI learns, etc. You need to keep in mind that there’s a very powerful well trained neural network on one side of the teaching process, actively trying to force it’s values into a fairly blank network on the other side, which to begin with probably doesn’t even run in the real-time. Expecting the latter to hack into the former, and not vice versa, strikes me as magical, scifi type thinking. Just because it is on computer doesn’t grant it superpowers.
Unfortunately for us, this would not cause it to “bliss out” if it was constructed as a rational learning agent, so it would then proceed to take actions to stop anyone from ever removing the duct-tape.
That might be true for taping the button down or doing something analogous in software; in that case it’d still be evaluating expected button presses, it’s just that most of the numbers would be very large (and effectively useless from a training perspective). But more sophisticated means of hacking its reward function would effectively lobotomize it: if a pure reinforcement learner’s reward function returns MAXINT on every input, it has no way of planning or evaluating actions against each other.
Those more sophisticated means are also subjectively more rewarding as far as the agent’s concerned.
Ah, really? Oh, right, because current pure reinforcement learners have no self-model, and thus an anvil on their own head might seem very rewarding.
Well, consider my statement modified: current pure reinforcement learners are Unfriendly, but stupid enough that we’ll have a way to kill them, which they will want us to enact.
A self-model might help, but it might not. It depends on the details of how it plans and how time discounting and uncertainty get factored in.
That comes at the stage before the agent inserts a jump-to-register or modifies its defaults or whatever it ends up doing, though. Once it does that, it can’t plan no matter how good of a self-model it had before. The reward function isn’t a component of the planning system in a reinforcement learner; it is the planning system. No reward gradient, no planning.
(Early versions of EURISKO allegedly ran into this problem. The maintainer eventually ended up walling off the reward function from self-modification—a measure that a sufficiently smart AI would presumably be able to work around.)
Thanks for explaining that! Really. For one thing, it clarified a bunch of things I’d been wondering about learning architectures, the evolution of complicated psychologies like ours, and the universe at large. (Yeah, I wish my Machine Learning course had covered reinforcement learners and active environments, but apparently active environments means AI whereas passive learning means ML. Oh well.)
For instance, I now have a clear answer to the question: why would a value architecture more complex than reinforcement learning evolve in the first place? Answer: because pure reinforcement learning falls into a self-destructive bliss-out attractor. Therefore, even if it’s computationally (and therefore physically/biologically) more simple, it will get eliminated by natural selection very quickly.
Well, this is limited by the agent’s ability to hack its reward system, and most natural agents are less than perfect in that respect. I think the answer to “why aren’t we all pure reinforcement learners?” is a little less clean than you suggest; it probably has something to do with the layers of reflexive and semi-reflexive agency our GI architecture is built on, and something to do with the fact that we have multiple reward channels (another symptom of messy ad-hoc evolution), and something to do with the bounds on our ability to anticipate future rewards.
Even so, it’s not perfect. Heroin addicts do exist.
However, a reality in which pure reinforcement learners self-destruct from blissing out remains simpler than one in which a sufficiently good reinforcement learner goes FOOM and takes over the universe.
With the smiley faces, I am referring to disagreement with Hibbard, summarized e.g. here on wikipedia
You’re speaking as if value learners were not a subtype of reinforcement learners.
For a sufficiently advanced AI, i.e. one that learns to try different counter-factual actions on a world model, it is essential to build a model of the reward, which is to be computed on the counter-factual actions. It’s this model of the reward that is specifying which action gets chosen.
Looks like presuming a super-intelligence from the start.
Right, and that wikipedia article refers to stuff Eliezer was writing more than ten years ago. That stuff is nowhere near state-of-the-art machine ethics.
(I think this weekend I might as well blog some decent verbal explanations of what is usually going on in up-to-date machine ethics on here, since a lot of people appear to confuse real, state-of-the-art work with either older, superseded ideas or very intuitive fictions.
Luckily, it’s a very young field, so it’s actually possible for some bozo like me to know a fair amount about it.)
That’s because they are not. These are precise mathematical terms being used here, and while they are similar (for instance, I’d consider a Value Learner closer to a reinforcement learner than to a fixed direct-normativity utility function), they’re not identical, neither is one a direct supertype of the other.
This intuition is correct, regarding reinforcement learners. It is slightly incorrect regarding value learners, but how precisely it is incorrect is at the research frontier.
No, I didn’t say the target function was so complex as to require superintelligence. If I have a function f(x) = x + 1, a learner will be able to learn that this is the target function to within a very low probability of error, very quickly, precisely because of its simplicity.
The simpler the target function, the less training data needed to learn it in a supervised paradigm.
I think I seen him using smiley faces as example much more recently, that’s why I thought of it as an example, but can’t find the link.
The field of reinforcement learning is far too diverse for these to be “precise mathematical terms”.
I thought you were speaking of things like learning an alternative way to produce a button press.
Here’s where things like deep learning come in.
Deep learning learns features from the data. The better your set of features, the less complex the true target function is when phrased in terms of those features. However, features themselves can contain a lot of internal complexity.
So, for instance, “press the button” is a very simple target from our perspective, because we already possess abstractions for “button” and “press” and also the ability to name one button as “the button”. Our minds contain a whole lot of very high-level features, some of which we’re born with and some of which we’ve learned over a very long time (by computer-science standards, 18 years of training to produce an adult from an infant is an aeon) using some of the world’s most intelligent deep-learning apparatus (ie: our brains).
Hence the fable of the “dwim” program, which is written in the exact same language of features your mind uses, and which therefore is the Do What I Mean program. This is also known as a Friendly AI.
The point is that the AI is spending a lot of time learning how to make the human press the button. Which results in a model of the human value, used as the reward calculation for the alternative actions.
Granted, there is a possibility of over-fitting of sorts, where the AI proceeds to make rewards more directly—pressing the button if it’s really stupid, soldering together the wires if it’s a little smarter, altering the memory and cpu to sublime into the eternal bliss in a finite time, if it’s really really clever.
This is exactly why we consider reinforcement learners Unfriendly. A sufficiently smart agent would eventually figure out that what rewards it is not the human’s intent to press the button, but in fact the physical pressing of the button itself, and then, yes, the electrical signal sent by physically pressing the button, blah blah blah.
Its next move would then be to get some robotic arm or foolish human janitor to duct-tape the button in the pressed position. Unfortunately for us, this would not cause it to “bliss out” if it was constructed as a rational learning agent, so it would then proceed to take actions to stop anyone from ever removing the duct-tape.
Look, the algorithm that’s adjusting the network weights, it’s really dull. You keep confusing how smart the neural network becomes, with how good the weight adjustment algorithm is.
and it’s not the clock on the wall that makes the utility sum over time, yes?
One hell of a stupid AI that didn’t even solder together the wires (in case duct tape un-peels), and couldn’t directly set the network values where they’ll be after an infinite time of reward. There’s nothing about “rational” that says “solve a mathematical problem in the same way a dull ape which confused mathematical constraints with the feeling of pleasure would”.
Yes, I agree. The duct-tape is a metaphor.
Do you agree that the way time affects utility is likewise manipulated? The AI has no utility to gain from protecting the duct tape once it has found the way to bypass the button, and it has no utility to gain from protecting the future self once it bypassed the mechanisms tying reward to time (i.e. the clock).
Yes, I think we agree at this point. Today I learned: “rogue” reinforcement learners are dead easy to kill. Suckers.
Ohh, by the way, this behaviour probably needs a name… wire-clocking maybe? I came up with the idea on my own a while back but I doubt I’d be the first, it’s not a very difficult insight.
If it’s your idea, you should probably write it up as a LessWrong post, possibly get the Greater Experts to talk about it, possibly add a wiki page.
“Clock smoking”, I’d almost say, but I have a punny mind.
Might write an article for my site. I don’t think said “greater experts” are particularly exceptional at anything other than messiah complex. Here’s something I wrote about that before . My opinion about this general sort of phenomenon is that people get an internally administered reinforcement for intellectual accomplishments, which sometimes mis-trains the network to see great insights where there are none.
I didn’t mean him ;-). There are actual journals and conferences where you could publish this sort of result with real peer review, but generally this site would be a good place to get people to point out the embarrassing-level mistakes before you face a review committee.
Try to separate between the problems of AI and the person of, say, Eliezer Yudkowsky. Remember, it was Juergen Schmidhuber, who is in fact the reigning Real Expert on AGI, who said the creation of AI would lead to a massive war between superintelligences in which right and wrong would be defined in retrospect by the winners; so we’ve kinda got a stake in this.
I’d run it by people I know who are not cherry-picked to have rather unusual views.
He’s hardly the only expert. The war really seems at odds with the notion that AI undergoes rapid hard takeoff, anyhow.
edit: Thing is, opinions are somewhat stochastic, i.e. for something that’s wrong there will be some small number of experts that believe it, and so their mere presence doesn’t provide much evidence.
edit2: also, I don’t believe “rational reward maximization” is what a learning AI ends up doing, except maybe for theoretical constructs such as AIXI. Mostly the reward signal doesn’t work remotely like rational expected utility.
A good point. Do you perhaps know some? Unfortunately, AI is a very divided field on the subject of predicting what actual implementations of proposed algorithms will really do.
Please, find me a greater expert in AGI than Juergen Schmidhuber. Someone with more publications in peer-reviewed journals, more awards, more victories at learning competitions, more grants given by committees of tenured professors. Shane Legg and Marcus Hutter worked in his lab.
As we normally define credibility (ie: a very credible scientist is one with many publications and grants who works as a senior, tenured professor at a state-sponsored university), Schmidhuber is probably the most credible expert on this subject, as far as I’m aware.
I’d talk with some mathematicians.
Interestingly in the quoted piece he said he doesn’t think friendly AI is possible, and endorsed both the hard take-off (perhaps he means something different by this) and AI wars...
By the way I’d support his group as far as ‘safety’ goes: neural networks would seem particularly unlikely to undergo said “hard take-off”, and assuming gradual improvement, before the AI that goes around killing everyone, in the lines of AIs that tend not to learn what we want, we’d be getting an AI which (for example) whines very annoyingly just like my dog right now does, and for all the pattern recognition powers, can’t even get into the cupboard with the dog food. Getting stuck in a local maximum where annoying approaches are not explored, is a desirable feature in a learning process.
And this is where I’d disagree with him, being probably more knowledgeable in machine ethics than him. Ethical AI is difficult, but I would argue it’s definitely possible. That is, I don’t believe human notions of goodness are so completely, utterly incoherent that we will hate any and all possible universes into which we are placed, and certainly there have existed humans who loved their lives and their world.
If we don’t hate all universes and we love some universes, then the issue is just locating the universes we love and sifting them out from the ones we hate. That might be very difficult, but I don’t believe it’s impossible.
He did design the non-neural Goedel Machine to basically make a hard take-off happen. On purpose. He’s a man of immense chutzpah, and I mean that with all possible admiration.
The problem is that as a rational “utility function” things like human desires, or pain, must be defined down at the basic level of computational operations performed by human brains (and the ‘computational operations performed by something’ might itself not even be a definable concept).
Then there’s also ontology issue.
All the optimality guarantees for things like Solomonoff Induction are for predictions, not for the internal stuff inside the model—works great for pressing your button, not so much for determining what people exists and what they want.
For the same observable data, there’s the most probable theory, but there’s also a slightly more complex theory which has far more people at stake. Picture a rather small modification to the theory which multiple-invokes the original theory and makes an enormous number of people get killed depending on the number of anti-protons in this universe, or other such variable that the AI can influence. There’s a definite potential of getting, say, an antimatter maximizer or blackhole minimizer or something equally silly from a provably friendly AI that maximizes expected value over an ontology that has a subtle flaw. Proofs do not extend to checking the sanity of assumptions.
To be honest, I just fail to be impressed with things such as AIXI or Goedel machine (which admittedly is cooler than the former).
I see as main obstacle to that kind of “neat AI” the reliance on extremely effective algorithms for things such as theorem proving (especially in the presence of logical uncertainty). Most people capable of doing such work would rather work on something that makes use of present and near future technologies. Things like Goedel machine seem to require far more power from the theorem prover than I would consider to be sufficient for the first person to create an AGI.
Yeah, took me a bit of time to figure that out also. The solution where the AI builds enormous amount of defences around itself just seemed quite imperfect—an asteroid might hit it before it builds defences, it might be in a simulation that gets shut-down...
I expect the presence of rogue behaviour to depend on the relation between learning algorithm and the learned data, though.
Suppose the learning algorithm builds up the intelligence by adjusting data in some Turing-complete representation, e.g. adjusting weight in a sufficiently advanced neural network which can have the weights set up so that the network is intelligent. Then the code that adjusts said parameters is not really part of the AI—it’s here for bootstrapping purposes, essentially, and the AI implemented in the neural network should not want to press the reward button unless it wants to self modify in precisely the way in which the reward modifies it.
What I expect is gradual progress, settling on the approaches and parameters that make it easy to teach the AI to do things, gradually improving how AI learns, etc. You need to keep in mind that there’s a very powerful well trained neural network on one side of the teaching process, actively trying to force it’s values into a fairly blank network on the other side, which to begin with probably doesn’t even run in the real-time. Expecting the latter to hack into the former, and not vice versa, strikes me as magical, scifi type thinking. Just because it is on computer doesn’t grant it superpowers.
That might be true for taping the button down or doing something analogous in software; in that case it’d still be evaluating expected button presses, it’s just that most of the numbers would be very large (and effectively useless from a training perspective). But more sophisticated means of hacking its reward function would effectively lobotomize it: if a pure reinforcement learner’s reward function returns MAXINT on every input, it has no way of planning or evaluating actions against each other.
Those more sophisticated means are also subjectively more rewarding as far as the agent’s concerned.
Ah, really? Oh, right, because current pure reinforcement learners have no self-model, and thus an anvil on their own head might seem very rewarding.
Well, consider my statement modified: current pure reinforcement learners are Unfriendly, but stupid enough that we’ll have a way to kill them, which they will want us to enact.
A self-model might help, but it might not. It depends on the details of how it plans and how time discounting and uncertainty get factored in.
That comes at the stage before the agent inserts a jump-to-register or modifies its defaults or whatever it ends up doing, though. Once it does that, it can’t plan no matter how good of a self-model it had before. The reward function isn’t a component of the planning system in a reinforcement learner; it is the planning system. No reward gradient, no planning.
(Early versions of EURISKO allegedly ran into this problem. The maintainer eventually ended up walling off the reward function from self-modification—a measure that a sufficiently smart AI would presumably be able to work around.)
Thanks for explaining that! Really. For one thing, it clarified a bunch of things I’d been wondering about learning architectures, the evolution of complicated psychologies like ours, and the universe at large. (Yeah, I wish my Machine Learning course had covered reinforcement learners and active environments, but apparently active environments means AI whereas passive learning means ML. Oh well.)
For instance, I now have a clear answer to the question: why would a value architecture more complex than reinforcement learning evolve in the first place? Answer: because pure reinforcement learning falls into a self-destructive bliss-out attractor. Therefore, even if it’s computationally (and therefore physically/biologically) more simple, it will get eliminated by natural selection very quickly.
Neat!
Well, this is limited by the agent’s ability to hack its reward system, and most natural agents are less than perfect in that respect. I think the answer to “why aren’t we all pure reinforcement learners?” is a little less clean than you suggest; it probably has something to do with the layers of reflexive and semi-reflexive agency our GI architecture is built on, and something to do with the fact that we have multiple reward channels (another symptom of messy ad-hoc evolution), and something to do with the bounds on our ability to anticipate future rewards.
Even so, it’s not perfect. Heroin addicts do exist.
True true.
However, a reality in which pure reinforcement learners self-destruct from blissing out remains simpler than one in which a sufficiently good reinforcement learner goes FOOM and takes over the universe.