I’m trying a new way of wording this; maybe it’ll be helpful, maybe it’ll be worthless. We’ll find out.
Suppose we have a program that executes code in a while loop. It’s got some perceptual input systems, and some actuator output systems. We’ll ignore, for the moment, the question of whether those are cameras and servomotors or an ethernet connection with bits flowing both ways.
It seems like there’s an important distinction between the model this program has of the world (“If I open this valve, my velocity will change like so”) and what it’s trying to do (“it’s better to be there than here.”). It also seems like there’s a fairly small set of possible classifications for what it’s trying to do:
Inactive: it never initiates any actions, and can basically be ignored.
Reactive: it has a set of predefined reflexes, and serves basically as a complicated API.
Active: it has some sort of preferences that it attempts to fulfill by modifying the world.
It looks to me like it doesn’t matter much how the active system’s preferences are implemented, that is, whether it’s utility maximization, a reinforcement learning agent, taking the action that seems like the most likely action that it’ll take, a congress of subagents that vote on actions, and so on. For any possible architecture, it’s possible to exhibit an example from that architecture that falls into various sorts of preference traps.
For example, the trap of direct stimulation is one where the program spends all its computational resources on gaming its preference mechanism. If it’s a utility maximizer, it just generates larger numbers to stick in the ‘utility’ memory location; if it’s a congress, it keeps generating voters that are maximally happy with the plan of generating more voters; and so on.
There’s a specific piece of mental machinery that avoids this trap: the understanding of the difference between the map and the territory, and specifying preferences over the territory instead of over the map. This pushes the traps up a meta-level; instead of gaming the map, now the program has to game the territory.
The motivation behind safety research, as I understand it, is to determine 1) what pieces of mental machinery we need to avoid those traps and 2) figure out how to make that machinery. That a particular system seems unlikely to be built (it seems unlikely that someone will make a robot that tries to mimic what it thinks a human will do by always choosing the most likely action for itself) doesn’t affect the meta-level point that we need to find and develop trap-avoidance machinery, hopefully in a way that can be easily translated between architectures.
Okay, so the overall point is agreed (”… that we need to find and develop trap-avoidance machinery”).
So if other people were to phrase the problem as you have phrased it here, describing the safety issue as all about trying to find AGI designs in which the danger of the AGI becoming loopy is minimized, then we are all pulling in the same direction.
But as you can imagine, the premise of the essay I wrote is that people are not speaking in those terms. They are—I would argue—postulating danger scenarios that are predicated on RL (and other fallacious assumptions) and then going off and doing research that is aimed at fixing the problems in hypothetical AI systems that are presumed to be built on an RL foundation.
That has two facets. First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function. Those donors are not being told that, actually, the scenario involves something that cannot be built (an RL superintelligence). Some people would call that fraudulent. Do you suppose they would want their money back if they were told that none of those actual scenarios are actually possible?
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
You might say “But nobody is assuming RL in those scenarios”. My experience is quite the opposite—people who promote those scenarios usually resort to RL as a justification for why the AI’s are so dangerous (RL doesn’t need anyone handcoding it, so you can just push the button and it goes Foom! by itself).
First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function.
My impression is that the argument for AI risk is architecture independent, and the arguments use RL agents as an illustrative example instead of as a foundation. When I try to explain AI risk to people, I typically start with the principal agent problem from institutional design (that is, the relationship between the owner and the manager of a business), but this has its own defects (it exaggerates the degree to which self-interest plays a role in AI risk, relative to the communication difficulties).
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
My impression is that, five years ago, the arguments were mostly about how a utility maximizer would go loopy, not how something built on a RL learner would go loopy. Even so, it seems like the progress that’s been made over the last five years is still useful for thinking about how to prevent RL learners from going loopy. Suppose that in another five years, it’s clear that instead of RL learners, the AGI will have architecture X; then it also seems reasonable to me to expect that whatever we’ve figured out trying to prevent RL learners from going loopy will likely transfer over to architecture X.
You seem to think otherwise, but it’s unclear to me why.
How can anything transfer from a (hypothetical) mechanism that cannot possibly scale up? That is one of the most obvious truisms of all engineering and mathematics—if you invent a formalism that does not actually work, or which is inconsistent, or which does not scale, it is blindingly obvious that any efforts spent writing theories about how a future version of that formalism will behave, are pointless.
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Consider a new abstraction called a “Quring Machine”. It is like a Turing machine, but for any starting tape that it gets, it sends that tape off to a planet where there is lots of primordial soup, then waits for the planet to evolve lifeforms which discover the tape and then invent a Macintosh-Plus-Equivalent computer and then write a version of the original tape that runs on that computer, and then the Quring Machine outputs the symbols that come from that alien Mac Plus. If the first planet fails to evolve appropriately, it tries another one, and keeps trying until the right response comes back.
Now, is that worth studying?
Reinforcement Learning, when assumed as a control mechanism for a macroscopic intelligent system, contains exactly the sort of ridiculous mechanism inside it, as the Quring Machine. (Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards).
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Agreed with both points, but I’m unclear on whether or not you still endorse the claim that we can’t get transfer from formalisms that do not actually work.
Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards
I agree with this, and I also agree with your point earlier that most of the work in modern ML systems that have a “RL core” is in the non-core parts that are doing the interesting pieces.
But it’s still not clear to me why this makes you think that RL, because it’s not a complete solution, won’t be a part of whatever the complete solution ends up being. I don’t think you could run a human on just reinforcement learning, as it seems likely that some other things are going on (like brain regions that seem hardwired to learn a particular thing), but I would also be surprised by a claim that no reinforcement learning is going on in humans.
Or maybe to put this a different way, I think there are problems probably inherent in all motivation systems, which you see with utility maximization and reinforcement learning and others. If we figure out a way to get around that problem with one system—say, finding a correction to a utility function that makes it corrigible—I also suspect that the solution will suggest equivalent solutions for other motivation mechanisms. (That is, given a utility function correction, it’s probably easy to come up with a reinforcement learning update correction.)
This makes me mostly uninterested in the reasons why particular motivation mechanisms are intractable or unlikely to be what we actually use, unless those reasons are also reasons to expect that any solutions designed for that mechanism will be difficult to transfer to other mechanisms.
I have been struggling to find a way to respond, here.
When discussing this, we have to be really careful not to slip back and forth between “global RL”, in the sense that the whole system learns through RL, and “micro-RL”, where bits of the system using something like RL. I do keep trying to emphasize that I have no problem with the latter, if it proves feasible. I would never “claim that no reinforcement learning is going on in humans” because, quite the contrary, I believe it really IS going on there.
So where does that leave my essay, and this discussion? Well, a few things are important.
1 -- The incredible minuteness of the feasible types of RL must be kept in mind. In pure form, it explodes or becomes infeasible if the micro-domain gets above the reflex (or insect) level.
2 -- We need to remember that plain old “adaptation” is not RL. So, is there an adaptation mechanism that builds (e.g.) low level feature detectors in the visual system? I bet there is. Does it work by trying to optimize a single parameter? Maybe. Should we call that parameter a “reward” signal? Well, I guess we could. But it is equally possible that such mechanisms are simultaneously optimizing a few parameters, not just one. And it is also just as likely that such mechanisms are following rules that cannot be shoehorned into the architecture of RL (there being many other kinds of adaptation). Where am I going with this? Well, why would we care to distinguish the “RL style of adaptation mechanism” from other kinds of adaptation, down at that level? Why make a special distinction? When you think about it, those micro-RL mechanisms are boring and unremarkable …… RL only becomes worth remarking on IF it is the explanation for intelligence as a whole. The behaviorists thought they were the Isaac Newtons of psychology, because they though that something like RL could explain everything. And it is only when it is proposed at that global level that it has dramatic significance, because then you could imagine an RL-controlled AI building and amplifying its own intelligence without programmer intervention.
3 -- Most importantly, if there do exist some “micro-RL” mechanisms somewhere in an intelligence, at very low levels where RL is feasible, those instances do not cause any of their properties to bleed upward to higher levels. This is the same as a really old saw ….. that, just because computers do all their basic computation with in binary, that does not mean that the highest levels of the computer must use binary numbers. Sometimes you say things that sort of imply that because RL could exist somewhere, therefore we could learn “maybe something” from those mechanisms, when it comes to other, higher aspects of the system. That really, really does not follow, and it is a dangerous mistake to make.
So, at the end of the day, my essay was targeting the use of the RL idea ONLY in those cases where it was assumed to be global. All other appearances of something RL-like just do not have any impact on arguments about AI motivation and goals.
Thats the wrong way round. They are generall cases of which the real life machines are special cases.
Loosemore is saying that RL is a special case that doesn’t generalise.
I’m not sure that I agree that a real car engine is a special case of a Carnot engine; I think the general case is a heat engine, of which a Carnot engine is a special case that is mathematically convenient but unattainable in physical reality.
I have edited the original essay to include a clarification of why I describe RL as being ubiquitous in AI Risk scenarios. I realized that some things that were obvious to me because of the long exposure I have had to people who try to justify the scenarios, was not obvious to everyone. My bad.
I’m trying a new way of wording this; maybe it’ll be helpful, maybe it’ll be worthless. We’ll find out.
Suppose we have a program that executes code in a while loop. It’s got some perceptual input systems, and some actuator output systems. We’ll ignore, for the moment, the question of whether those are cameras and servomotors or an ethernet connection with bits flowing both ways.
It seems like there’s an important distinction between the model this program has of the world (“If I open this valve, my velocity will change like so”) and what it’s trying to do (“it’s better to be there than here.”). It also seems like there’s a fairly small set of possible classifications for what it’s trying to do:
Inactive: it never initiates any actions, and can basically be ignored.
Reactive: it has a set of predefined reflexes, and serves basically as a complicated API.
Active: it has some sort of preferences that it attempts to fulfill by modifying the world.
It looks to me like it doesn’t matter much how the active system’s preferences are implemented, that is, whether it’s utility maximization, a reinforcement learning agent, taking the action that seems like the most likely action that it’ll take, a congress of subagents that vote on actions, and so on. For any possible architecture, it’s possible to exhibit an example from that architecture that falls into various sorts of preference traps.
For example, the trap of direct stimulation is one where the program spends all its computational resources on gaming its preference mechanism. If it’s a utility maximizer, it just generates larger numbers to stick in the ‘utility’ memory location; if it’s a congress, it keeps generating voters that are maximally happy with the plan of generating more voters; and so on.
There’s a specific piece of mental machinery that avoids this trap: the understanding of the difference between the map and the territory, and specifying preferences over the territory instead of over the map. This pushes the traps up a meta-level; instead of gaming the map, now the program has to game the territory.
The motivation behind safety research, as I understand it, is to determine 1) what pieces of mental machinery we need to avoid those traps and 2) figure out how to make that machinery. That a particular system seems unlikely to be built (it seems unlikely that someone will make a robot that tries to mimic what it thinks a human will do by always choosing the most likely action for itself) doesn’t affect the meta-level point that we need to find and develop trap-avoidance machinery, hopefully in a way that can be easily translated between architectures.
Hmmmmm… Interesting.
Okay, so the overall point is agreed (”… that we need to find and develop trap-avoidance machinery”).
So if other people were to phrase the problem as you have phrased it here, describing the safety issue as all about trying to find AGI designs in which the danger of the AGI becoming loopy is minimized, then we are all pulling in the same direction.
But as you can imagine, the premise of the essay I wrote is that people are not speaking in those terms. They are—I would argue—postulating danger scenarios that are predicated on RL (and other fallacious assumptions) and then going off and doing research that is aimed at fixing the problems in hypothetical AI systems that are presumed to be built on an RL foundation.
That has two facets. First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function. Those donors are not being told that, actually, the scenario involves something that cannot be built (an RL superintelligence). Some people would call that fraudulent. Do you suppose they would want their money back if they were told that none of those actual scenarios are actually possible?
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
You might say “But nobody is assuming RL in those scenarios”. My experience is quite the opposite—people who promote those scenarios usually resort to RL as a justification for why the AI’s are so dangerous (RL doesn’t need anyone handcoding it, so you can just push the button and it goes Foom! by itself).
My impression is that the argument for AI risk is architecture independent, and the arguments use RL agents as an illustrative example instead of as a foundation. When I try to explain AI risk to people, I typically start with the principal agent problem from institutional design (that is, the relationship between the owner and the manager of a business), but this has its own defects (it exaggerates the degree to which self-interest plays a role in AI risk, relative to the communication difficulties).
My impression is that, five years ago, the arguments were mostly about how a utility maximizer would go loopy, not how something built on a RL learner would go loopy. Even so, it seems like the progress that’s been made over the last five years is still useful for thinking about how to prevent RL learners from going loopy. Suppose that in another five years, it’s clear that instead of RL learners, the AGI will have architecture X; then it also seems reasonable to me to expect that whatever we’ve figured out trying to prevent RL learners from going loopy will likely transfer over to architecture X.
You seem to think otherwise, but it’s unclear to me why.
If we know nothing about architecture X, surely we should adopt a Bayesian 50:50 about whether some other architecture is applicable to it.
How can anything transfer from a (hypothetical) mechanism that cannot possibly scale up? That is one of the most obvious truisms of all engineering and mathematics—if you invent a formalism that does not actually work, or which is inconsistent, or which does not scale, it is blindingly obvious that any efforts spent writing theories about how a future version of that formalism will behave, are pointless.
We can build neither Universal Turing machines nor Carnot engines, but that doesn’t mean their properties aren’t worthwhile to study.
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Consider a new abstraction called a “Quring Machine”. It is like a Turing machine, but for any starting tape that it gets, it sends that tape off to a planet where there is lots of primordial soup, then waits for the planet to evolve lifeforms which discover the tape and then invent a Macintosh-Plus-Equivalent computer and then write a version of the original tape that runs on that computer, and then the Quring Machine outputs the symbols that come from that alien Mac Plus. If the first planet fails to evolve appropriately, it tries another one, and keeps trying until the right response comes back.
Now, is that worth studying?
Reinforcement Learning, when assumed as a control mechanism for a macroscopic intelligent system, contains exactly the sort of ridiculous mechanism inside it, as the Quring Machine. (Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards).
Agreed with both points, but I’m unclear on whether or not you still endorse the claim that we can’t get transfer from formalisms that do not actually work.
I agree with this, and I also agree with your point earlier that most of the work in modern ML systems that have a “RL core” is in the non-core parts that are doing the interesting pieces.
But it’s still not clear to me why this makes you think that RL, because it’s not a complete solution, won’t be a part of whatever the complete solution ends up being. I don’t think you could run a human on just reinforcement learning, as it seems likely that some other things are going on (like brain regions that seem hardwired to learn a particular thing), but I would also be surprised by a claim that no reinforcement learning is going on in humans.
Or maybe to put this a different way, I think there are problems probably inherent in all motivation systems, which you see with utility maximization and reinforcement learning and others. If we figure out a way to get around that problem with one system—say, finding a correction to a utility function that makes it corrigible—I also suspect that the solution will suggest equivalent solutions for other motivation mechanisms. (That is, given a utility function correction, it’s probably easy to come up with a reinforcement learning update correction.)
This makes me mostly uninterested in the reasons why particular motivation mechanisms are intractable or unlikely to be what we actually use, unless those reasons are also reasons to expect that any solutions designed for that mechanism will be difficult to transfer to other mechanisms.
I have been struggling to find a way to respond, here.
When discussing this, we have to be really careful not to slip back and forth between “global RL”, in the sense that the whole system learns through RL, and “micro-RL”, where bits of the system using something like RL. I do keep trying to emphasize that I have no problem with the latter, if it proves feasible. I would never “claim that no reinforcement learning is going on in humans” because, quite the contrary, I believe it really IS going on there.
So where does that leave my essay, and this discussion? Well, a few things are important.
1 -- The incredible minuteness of the feasible types of RL must be kept in mind. In pure form, it explodes or becomes infeasible if the micro-domain gets above the reflex (or insect) level.
2 -- We need to remember that plain old “adaptation” is not RL. So, is there an adaptation mechanism that builds (e.g.) low level feature detectors in the visual system? I bet there is. Does it work by trying to optimize a single parameter? Maybe. Should we call that parameter a “reward” signal? Well, I guess we could. But it is equally possible that such mechanisms are simultaneously optimizing a few parameters, not just one. And it is also just as likely that such mechanisms are following rules that cannot be shoehorned into the architecture of RL (there being many other kinds of adaptation). Where am I going with this? Well, why would we care to distinguish the “RL style of adaptation mechanism” from other kinds of adaptation, down at that level? Why make a special distinction? When you think about it, those micro-RL mechanisms are boring and unremarkable …… RL only becomes worth remarking on IF it is the explanation for intelligence as a whole. The behaviorists thought they were the Isaac Newtons of psychology, because they though that something like RL could explain everything. And it is only when it is proposed at that global level that it has dramatic significance, because then you could imagine an RL-controlled AI building and amplifying its own intelligence without programmer intervention.
3 -- Most importantly, if there do exist some “micro-RL” mechanisms somewhere in an intelligence, at very low levels where RL is feasible, those instances do not cause any of their properties to bleed upward to higher levels. This is the same as a really old saw ….. that, just because computers do all their basic computation with in binary, that does not mean that the highest levels of the computer must use binary numbers. Sometimes you say things that sort of imply that because RL could exist somewhere, therefore we could learn “maybe something” from those mechanisms, when it comes to other, higher aspects of the system. That really, really does not follow, and it is a dangerous mistake to make.
So, at the end of the day, my essay was targeting the use of the RL idea ONLY in those cases where it was assumed to be global. All other appearances of something RL-like just do not have any impact on arguments about AI motivation and goals.
Thats the wrong way round. They are generall cases of which the real life machines are special cases. Loosemore is saying that RL is a special case that doesn’t generalise.
I’m not sure that I agree that a real car engine is a special case of a Carnot engine; I think the general case is a heat engine, of which a Carnot engine is a special case that is mathematically convenient but unattainable in physical reality.
I have edited the original essay to include a clarification of why I describe RL as being ubiquitous in AI Risk scenarios. I realized that some things that were obvious to me because of the long exposure I have had to people who try to justify the scenarios, was not obvious to everyone. My bad.