When machine learning people talk about reinforcement, they are usually being not quite literal. In practice, if someone says they study reinforcement learning, they mean they’re studying a class of algorithms that’s sort of clustered around the traditional multi-armed-bandit stuff, and preserves a lot of the flavor, but which significantly deviate from the traditional formulation and which they’re hoping will overcome the limitations you describe. Similarly, when people talk about RL in the context of far-future predictions, they are imagining something which shares a few of the premises of today’s RL—a system which takes sequential actions and receives reward—but very little of the other details. This is poor terminology all around, but unfortunately research into “RL-flavored-but-not-literally-RL algorithms” doesn’t have a good ring to it.
A system which only shares a few of the features of RL, without sticking strictly to the main tenets of the paradigm, is one that does not have the dangers that these people are talking about—if they are just talking about systems that “generally take account of rewards, in a flexible and subtle way”, then there is no way to postulate those extreme scenarios in which the AI does something utterly bizarre.
If someone posits that a future AI could do something like rewiring its sensors or its rewriting its internal definitions in order to maximize a “reward signal”, then this begs the question of what kind of AI the person is assuming. If they assume crude RL, such a bizarre scenario is feasible.
But if on the other hand that person is being extremely inclusive about what they mean by “RL”, and they actually mean to encompass systems that “generally take account of rewards, in a flexible and subtle way”, then the scenario is nonsensical. It would be almost trivial to ensure that a system with that more nuanced design was constructed in such a way as to have checks and balances (global modeling of self, andlarge numbers of weak constraints) that prevent it from doing idiotic things like tampering with its sensors, etc.
Think of it in human terms. Teenagers don’t want to take out the trash, right? So what if they redefine trash as “the stuff in my bedroom wastebasket”? Then they can just do a really easy task, and say they have satisfied the requirement (they get the reward TASK-COMPLETED and asscoiated dopamine hit, presumably). But every human who is smart enough to be worth talking about eventually realizes that IF they carry on tampering with definitions in order to comply with such things, they will ultimately be screwed. So they stop doing it. That is an example of a system that “generally takes account of rewards, in a flexible and subtle way”.
But the people who discuss AI safety almost invariably talk in such a way that, actually, their assumed AI is following crude RL principles.
When machine learning people talk about reinforcement, they are usually being not quite literal. In practice, if someone says they study reinforcement learning, they mean they’re studying a class of algorithms that’s sort of clustered around the traditional multi-armed-bandit stuff, and preserves a lot of the flavor, but which significantly deviate from the traditional formulation and which they’re hoping will overcome the limitations you describe. Similarly, when people talk about RL in the context of far-future predictions, they are imagining something which shares a few of the premises of today’s RL—a system which takes sequential actions and receives reward—but very little of the other details. This is poor terminology all around, but unfortunately research into “RL-flavored-but-not-literally-RL algorithms” doesn’t have a good ring to it.
A system which only shares a few of the features of RL, without sticking strictly to the main tenets of the paradigm, is one that does not have the dangers that these people are talking about—if they are just talking about systems that “generally take account of rewards, in a flexible and subtle way”, then there is no way to postulate those extreme scenarios in which the AI does something utterly bizarre.
If someone posits that a future AI could do something like rewiring its sensors or its rewriting its internal definitions in order to maximize a “reward signal”, then this begs the question of what kind of AI the person is assuming. If they assume crude RL, such a bizarre scenario is feasible.
But if on the other hand that person is being extremely inclusive about what they mean by “RL”, and they actually mean to encompass systems that “generally take account of rewards, in a flexible and subtle way”, then the scenario is nonsensical. It would be almost trivial to ensure that a system with that more nuanced design was constructed in such a way as to have checks and balances (global modeling of self, andlarge numbers of weak constraints) that prevent it from doing idiotic things like tampering with its sensors, etc.
Think of it in human terms. Teenagers don’t want to take out the trash, right? So what if they redefine trash as “the stuff in my bedroom wastebasket”? Then they can just do a really easy task, and say they have satisfied the requirement (they get the reward TASK-COMPLETED and asscoiated dopamine hit, presumably). But every human who is smart enough to be worth talking about eventually realizes that IF they carry on tampering with definitions in order to comply with such things, they will ultimately be screwed. So they stop doing it. That is an example of a system that “generally takes account of rewards, in a flexible and subtle way”.
But the people who discuss AI safety almost invariably talk in such a way that, actually, their assumed AI is following crude RL principles.