Okay, so the overall point is agreed (”… that we need to find and develop trap-avoidance machinery”).
So if other people were to phrase the problem as you have phrased it here, describing the safety issue as all about trying to find AGI designs in which the danger of the AGI becoming loopy is minimized, then we are all pulling in the same direction.
But as you can imagine, the premise of the essay I wrote is that people are not speaking in those terms. They are—I would argue—postulating danger scenarios that are predicated on RL (and other fallacious assumptions) and then going off and doing research that is aimed at fixing the problems in hypothetical AI systems that are presumed to be built on an RL foundation.
That has two facets. First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function. Those donors are not being told that, actually, the scenario involves something that cannot be built (an RL superintelligence). Some people would call that fraudulent. Do you suppose they would want their money back if they were told that none of those actual scenarios are actually possible?
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
You might say “But nobody is assuming RL in those scenarios”. My experience is quite the opposite—people who promote those scenarios usually resort to RL as a justification for why the AI’s are so dangerous (RL doesn’t need anyone handcoding it, so you can just push the button and it goes Foom! by itself).
First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function.
My impression is that the argument for AI risk is architecture independent, and the arguments use RL agents as an illustrative example instead of as a foundation. When I try to explain AI risk to people, I typically start with the principal agent problem from institutional design (that is, the relationship between the owner and the manager of a business), but this has its own defects (it exaggerates the degree to which self-interest plays a role in AI risk, relative to the communication difficulties).
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
My impression is that, five years ago, the arguments were mostly about how a utility maximizer would go loopy, not how something built on a RL learner would go loopy. Even so, it seems like the progress that’s been made over the last five years is still useful for thinking about how to prevent RL learners from going loopy. Suppose that in another five years, it’s clear that instead of RL learners, the AGI will have architecture X; then it also seems reasonable to me to expect that whatever we’ve figured out trying to prevent RL learners from going loopy will likely transfer over to architecture X.
You seem to think otherwise, but it’s unclear to me why.
How can anything transfer from a (hypothetical) mechanism that cannot possibly scale up? That is one of the most obvious truisms of all engineering and mathematics—if you invent a formalism that does not actually work, or which is inconsistent, or which does not scale, it is blindingly obvious that any efforts spent writing theories about how a future version of that formalism will behave, are pointless.
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Consider a new abstraction called a “Quring Machine”. It is like a Turing machine, but for any starting tape that it gets, it sends that tape off to a planet where there is lots of primordial soup, then waits for the planet to evolve lifeforms which discover the tape and then invent a Macintosh-Plus-Equivalent computer and then write a version of the original tape that runs on that computer, and then the Quring Machine outputs the symbols that come from that alien Mac Plus. If the first planet fails to evolve appropriately, it tries another one, and keeps trying until the right response comes back.
Now, is that worth studying?
Reinforcement Learning, when assumed as a control mechanism for a macroscopic intelligent system, contains exactly the sort of ridiculous mechanism inside it, as the Quring Machine. (Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards).
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Agreed with both points, but I’m unclear on whether or not you still endorse the claim that we can’t get transfer from formalisms that do not actually work.
Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards
I agree with this, and I also agree with your point earlier that most of the work in modern ML systems that have a “RL core” is in the non-core parts that are doing the interesting pieces.
But it’s still not clear to me why this makes you think that RL, because it’s not a complete solution, won’t be a part of whatever the complete solution ends up being. I don’t think you could run a human on just reinforcement learning, as it seems likely that some other things are going on (like brain regions that seem hardwired to learn a particular thing), but I would also be surprised by a claim that no reinforcement learning is going on in humans.
Or maybe to put this a different way, I think there are problems probably inherent in all motivation systems, which you see with utility maximization and reinforcement learning and others. If we figure out a way to get around that problem with one system—say, finding a correction to a utility function that makes it corrigible—I also suspect that the solution will suggest equivalent solutions for other motivation mechanisms. (That is, given a utility function correction, it’s probably easy to come up with a reinforcement learning update correction.)
This makes me mostly uninterested in the reasons why particular motivation mechanisms are intractable or unlikely to be what we actually use, unless those reasons are also reasons to expect that any solutions designed for that mechanism will be difficult to transfer to other mechanisms.
I have been struggling to find a way to respond, here.
When discussing this, we have to be really careful not to slip back and forth between “global RL”, in the sense that the whole system learns through RL, and “micro-RL”, where bits of the system using something like RL. I do keep trying to emphasize that I have no problem with the latter, if it proves feasible. I would never “claim that no reinforcement learning is going on in humans” because, quite the contrary, I believe it really IS going on there.
So where does that leave my essay, and this discussion? Well, a few things are important.
1 -- The incredible minuteness of the feasible types of RL must be kept in mind. In pure form, it explodes or becomes infeasible if the micro-domain gets above the reflex (or insect) level.
2 -- We need to remember that plain old “adaptation” is not RL. So, is there an adaptation mechanism that builds (e.g.) low level feature detectors in the visual system? I bet there is. Does it work by trying to optimize a single parameter? Maybe. Should we call that parameter a “reward” signal? Well, I guess we could. But it is equally possible that such mechanisms are simultaneously optimizing a few parameters, not just one. And it is also just as likely that such mechanisms are following rules that cannot be shoehorned into the architecture of RL (there being many other kinds of adaptation). Where am I going with this? Well, why would we care to distinguish the “RL style of adaptation mechanism” from other kinds of adaptation, down at that level? Why make a special distinction? When you think about it, those micro-RL mechanisms are boring and unremarkable …… RL only becomes worth remarking on IF it is the explanation for intelligence as a whole. The behaviorists thought they were the Isaac Newtons of psychology, because they though that something like RL could explain everything. And it is only when it is proposed at that global level that it has dramatic significance, because then you could imagine an RL-controlled AI building and amplifying its own intelligence without programmer intervention.
3 -- Most importantly, if there do exist some “micro-RL” mechanisms somewhere in an intelligence, at very low levels where RL is feasible, those instances do not cause any of their properties to bleed upward to higher levels. This is the same as a really old saw ….. that, just because computers do all their basic computation with in binary, that does not mean that the highest levels of the computer must use binary numbers. Sometimes you say things that sort of imply that because RL could exist somewhere, therefore we could learn “maybe something” from those mechanisms, when it comes to other, higher aspects of the system. That really, really does not follow, and it is a dangerous mistake to make.
So, at the end of the day, my essay was targeting the use of the RL idea ONLY in those cases where it was assumed to be global. All other appearances of something RL-like just do not have any impact on arguments about AI motivation and goals.
Thats the wrong way round. They are generall cases of which the real life machines are special cases.
Loosemore is saying that RL is a special case that doesn’t generalise.
I’m not sure that I agree that a real car engine is a special case of a Carnot engine; I think the general case is a heat engine, of which a Carnot engine is a special case that is mathematically convenient but unattainable in physical reality.
Hmmmmm… Interesting.
Okay, so the overall point is agreed (”… that we need to find and develop trap-avoidance machinery”).
So if other people were to phrase the problem as you have phrased it here, describing the safety issue as all about trying to find AGI designs in which the danger of the AGI becoming loopy is minimized, then we are all pulling in the same direction.
But as you can imagine, the premise of the essay I wrote is that people are not speaking in those terms. They are—I would argue—postulating danger scenarios that are predicated on RL (and other fallacious assumptions) and then going off and doing research that is aimed at fixing the problems in hypothetical AI systems that are presumed to be built on an RL foundation.
That has two facets. First, donors are being scared into giving money by scenarios in which the AI does crazy things, but those crazy things are a direct result of the assumption that this AI is an insane RL engine that is gaming its reward function. Those donors are not being told that, actually, the scenario involves something that cannot be built (an RL superintelligence). Some people would call that fraudulent. Do you suppose they would want their money back if they were told that none of those actual scenarios are actually possible?
Second facet: when the research to combat the dangers is predicated on non-feasible RL extrapolations, the research is worthless. And the donors’ money is wasted.
You might say “But nobody is assuming RL in those scenarios”. My experience is quite the opposite—people who promote those scenarios usually resort to RL as a justification for why the AI’s are so dangerous (RL doesn’t need anyone handcoding it, so you can just push the button and it goes Foom! by itself).
My impression is that the argument for AI risk is architecture independent, and the arguments use RL agents as an illustrative example instead of as a foundation. When I try to explain AI risk to people, I typically start with the principal agent problem from institutional design (that is, the relationship between the owner and the manager of a business), but this has its own defects (it exaggerates the degree to which self-interest plays a role in AI risk, relative to the communication difficulties).
My impression is that, five years ago, the arguments were mostly about how a utility maximizer would go loopy, not how something built on a RL learner would go loopy. Even so, it seems like the progress that’s been made over the last five years is still useful for thinking about how to prevent RL learners from going loopy. Suppose that in another five years, it’s clear that instead of RL learners, the AGI will have architecture X; then it also seems reasonable to me to expect that whatever we’ve figured out trying to prevent RL learners from going loopy will likely transfer over to architecture X.
You seem to think otherwise, but it’s unclear to me why.
If we know nothing about architecture X, surely we should adopt a Bayesian 50:50 about whether some other architecture is applicable to it.
How can anything transfer from a (hypothetical) mechanism that cannot possibly scale up? That is one of the most obvious truisms of all engineering and mathematics—if you invent a formalism that does not actually work, or which is inconsistent, or which does not scale, it is blindingly obvious that any efforts spent writing theories about how a future version of that formalism will behave, are pointless.
We can build neither Universal Turing machines nor Carnot engines, but that doesn’t mean their properties aren’t worthwhile to study.
Turing machines and Carnot engines are abstractions that manifestly can apply to real systems.
But just because something is an abstraction, doesn’t mean it’s properties apply to anything.
Consider a new abstraction called a “Quring Machine”. It is like a Turing machine, but for any starting tape that it gets, it sends that tape off to a planet where there is lots of primordial soup, then waits for the planet to evolve lifeforms which discover the tape and then invent a Macintosh-Plus-Equivalent computer and then write a version of the original tape that runs on that computer, and then the Quring Machine outputs the symbols that come from that alien Mac Plus. If the first planet fails to evolve appropriately, it tries another one, and keeps trying until the right response comes back.
Now, is that worth studying?
Reinforcement Learning, when assumed as a control mechanism for a macroscopic intelligent system, contains exactly the sort of ridiculous mechanism inside it, as the Quring Machine. (Global RL requires staggering amounts of computation and long wait times, to accumulate enough experience for the competing policies to develop enough data for meaningful calculations about their relationship to rewards).
Agreed with both points, but I’m unclear on whether or not you still endorse the claim that we can’t get transfer from formalisms that do not actually work.
I agree with this, and I also agree with your point earlier that most of the work in modern ML systems that have a “RL core” is in the non-core parts that are doing the interesting pieces.
But it’s still not clear to me why this makes you think that RL, because it’s not a complete solution, won’t be a part of whatever the complete solution ends up being. I don’t think you could run a human on just reinforcement learning, as it seems likely that some other things are going on (like brain regions that seem hardwired to learn a particular thing), but I would also be surprised by a claim that no reinforcement learning is going on in humans.
Or maybe to put this a different way, I think there are problems probably inherent in all motivation systems, which you see with utility maximization and reinforcement learning and others. If we figure out a way to get around that problem with one system—say, finding a correction to a utility function that makes it corrigible—I also suspect that the solution will suggest equivalent solutions for other motivation mechanisms. (That is, given a utility function correction, it’s probably easy to come up with a reinforcement learning update correction.)
This makes me mostly uninterested in the reasons why particular motivation mechanisms are intractable or unlikely to be what we actually use, unless those reasons are also reasons to expect that any solutions designed for that mechanism will be difficult to transfer to other mechanisms.
I have been struggling to find a way to respond, here.
When discussing this, we have to be really careful not to slip back and forth between “global RL”, in the sense that the whole system learns through RL, and “micro-RL”, where bits of the system using something like RL. I do keep trying to emphasize that I have no problem with the latter, if it proves feasible. I would never “claim that no reinforcement learning is going on in humans” because, quite the contrary, I believe it really IS going on there.
So where does that leave my essay, and this discussion? Well, a few things are important.
1 -- The incredible minuteness of the feasible types of RL must be kept in mind. In pure form, it explodes or becomes infeasible if the micro-domain gets above the reflex (or insect) level.
2 -- We need to remember that plain old “adaptation” is not RL. So, is there an adaptation mechanism that builds (e.g.) low level feature detectors in the visual system? I bet there is. Does it work by trying to optimize a single parameter? Maybe. Should we call that parameter a “reward” signal? Well, I guess we could. But it is equally possible that such mechanisms are simultaneously optimizing a few parameters, not just one. And it is also just as likely that such mechanisms are following rules that cannot be shoehorned into the architecture of RL (there being many other kinds of adaptation). Where am I going with this? Well, why would we care to distinguish the “RL style of adaptation mechanism” from other kinds of adaptation, down at that level? Why make a special distinction? When you think about it, those micro-RL mechanisms are boring and unremarkable …… RL only becomes worth remarking on IF it is the explanation for intelligence as a whole. The behaviorists thought they were the Isaac Newtons of psychology, because they though that something like RL could explain everything. And it is only when it is proposed at that global level that it has dramatic significance, because then you could imagine an RL-controlled AI building and amplifying its own intelligence without programmer intervention.
3 -- Most importantly, if there do exist some “micro-RL” mechanisms somewhere in an intelligence, at very low levels where RL is feasible, those instances do not cause any of their properties to bleed upward to higher levels. This is the same as a really old saw ….. that, just because computers do all their basic computation with in binary, that does not mean that the highest levels of the computer must use binary numbers. Sometimes you say things that sort of imply that because RL could exist somewhere, therefore we could learn “maybe something” from those mechanisms, when it comes to other, higher aspects of the system. That really, really does not follow, and it is a dangerous mistake to make.
So, at the end of the day, my essay was targeting the use of the RL idea ONLY in those cases where it was assumed to be global. All other appearances of something RL-like just do not have any impact on arguments about AI motivation and goals.
Thats the wrong way round. They are generall cases of which the real life machines are special cases. Loosemore is saying that RL is a special case that doesn’t generalise.
I’m not sure that I agree that a real car engine is a special case of a Carnot engine; I think the general case is a heat engine, of which a Carnot engine is a special case that is mathematically convenient but unattainable in physical reality.