For starters, suppose the AI straps lots of humans into beds, giving them endless morphine and heroin IV drips, and the humans get into such a state of delirium that they repeatedly praise and thank the AI for continuing to keep the heroin drip turned on.
This dystopian situation would be, to the AI, absolute ecstasy—much like the heroin to those poor humans.
This seems to require some pretty important normative claims that seem controversial in EA and the rationality community. Based on your description, it seems like the humans come to approve of this (desire/prefer this) more than they approved of their lives before (or we could imagine similar scenarios where this is the case), and gain more pleasure from it, and you could have their approval (by assumption) outweigh the violation of the preferences for this not to happen. So, if you’re a welfarist consequentialist and a hedonist or a desire/preference theorist, and unless an individual’s future preferences count much less, this just seems better for those humans than what normal life has been like lately.
Some ways out seem to be:
Maybe certain preference-affecting views, or discounting future preferences, or antifrustrationism (basically negative preference utilitarianism), or something in these directions
Counting preferences more or less based on their specific contents, e.g. wanting to take heroin is a preference that counts less in your calculus
Non-hedonist and non-preferential/desire-based welfare (possibly in addition to hedonistic and preferential/desire-based welfare), e.g. objective goods/bads
Non-welfarist consequentialist values, i.e. valuing outcomes for reasons other than how they matter to individuals’ welfare
Non-consequentialism, e.g. constraints on violating preferences or consent, or not getting affirmative consent
Actually, whatever preferences they had before and were violated were stronger than the ones they have now and are satisfied
Actually, we should think much bigger; maybe we should optimize with artificial consciousness instead and do that a lot more.
If it’s something like 1 or 5, it should instead (or also?) model what the humans already want, and try to get that to happen.
Let’s please not have an object-level debate of hedonic utilitarianism (or whatever) here. It’s very off-topic.
I personally am strongly against strapping humans into chairs on heroin drips. If anyone reading this is in favor of the heroin-drip thing (which seems unlikely to me but what do I know?), then bad news for them is that there’s no strong reason to expect that from APTAMI either.
My discussion there in OP was not a prediction of what APTAMI will lead to, it was an illustrative example of one of a near-infinite number of weird edge cases in which its classifiers would give undesired results. In particular, I think hedonic utilitarians would also sorely regret turning on an APTAMI system as described. There is no moral philosophy that anyone subscribes to, such that they would have a strong reason to believe that turning on APTAMI would not be a catastrophic mistake by their own lights.
I think your scenario only illustrates a problem with outer alignment (picking the right objective function), and I think it’s possible to state an objective, that if it could be implemented sufficiently accurately and we could guarantee the AI followed it (inner alignment), would not result in a dystopia like this. If you think the model would do well at inner alignment if we fixed the problems with outer alignment, then it seems like a very promising direction and this would be worth pointing out and emphasizing.
I think the right direction is modelling how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents, especially how humans and other moral patients are affected (including what happened along the way, e.g. violations of consent and killing). I don’t think you need the coherency of coherent extrapolated volition, because you can still capture people finding this future substantially worse than some possible futures, including the ones where the AI doesn’t intervene, by some important lights, and just set a difference-making ambiguity averse objective. Or, maybe require it not to do substantially worse by any important light, if that’s feasible: we would allow the model to flexibly represent the lights by which humans judge outcomes where doing badly in one means doing badly overall. Then it would be incentivized to focus on acts that seem robustly better to humans.
I think an AI that actually followed such an objective properly would not, by the lights of whichever humans whose judgements it’s predicting, increase the risk of dystopia through its actions (although building it may be s-risky, in case of risks of minimization). Maybe it would cause us to slow moral progress and lock in the status quo, though. If the AI is smart enough, it can understand “how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents”, but it can still be hard to point the AI at that even if it did understand it.
Another approach I can imagine is to split up the rewards into periods, discount them temporally, check for approval and disapproval signals in each period, and make it very costly relative to anything else to miss one approval or receive a disapproval. I describe this more here and here. As JBlack pointed out in the comments of the second post, there’s incentive to hack the signal. However, as long as attempts to do so are risky enough by the lights of the AI and the AI is sufficiently averse to losing approval or getting disapproval and the risk of either is high enough, it wouldn’t do it. And of course, there’s still the problem of inner alignment; maybe it doesn’t even end up caring about approval/disapproval in the way our objective says it should out of distribution.
First, that the vague proposal for Intrinsic Cost in APTAMI will almost definitely lead to AIs that want to kill us.
Second, that nobody has a better proposal for Intrinsic Cost that’s good enough that we would have a strong reason to believe that the AI won’t want to kill us.
Somewhere in between those two claims is a question of whether it’s possible to edit the APTAMI proposal so that it’s less bad—even if it doesn’t meet the higher bar of “strong reason to believe it won’t won’t want to kill us”. My answer is “absolutely yes”. The APTAMI proposal is so bad that I find it quite easy to think of ways to make it less bad. The thing you mentioned (i.e. that “when perceiving joy in nearby humans” is a poorly-thought-through phrase and we can do better) is indeed one example.
My main research topic is (more-or-less) how to write an Innate Cost module, sometimes directly and sometimes indirectly. I don’t currently have any plan that passes the bar of “strong reason to believe the AI won’t want to kill us”. I do have proposals that seem to pass the much lower bar of “it seems at least possible that the AI won’t want to kill us”—see here for a self-contained example. The inner alignment problem is definitely relevant.
I believe Steven didn’t imply that a significant number of people would approve or want such a future—indeed, the opposite, hence he called the scenario “dystopian”.
He basically meant that optimising surface signals of pleasure does not automatically lead to behaviours and plans congruent with reasonable ethics, so the surface elements of alignment suggested by LeCun in the paper are clearly insufficient.
I think many EAs/rationalists shouldn’t find this to be worse for humans than life today on the views they apparently endorse, because each human looks better off under standard approaches to intrapersonal aggregation: they get more pleasure, less suffering, more preference satisfaction (or we can imagine some kind of manipulation to achieve this), but at the cost of some important frustrated preferences.
EA/rationalists seems to me the community who gives more conscious attention to the problem of following your officially stated preferences down a cliff, a.k.a. Goodhart.
FWIW, I’m not sure if that’s true relative to the average person, and I’d guess non-consequentialist philosophers are more averse to biting bullets than the average EA and maybe rationalist.
This seems to require some pretty important normative claims that seem controversial in EA and the rationality community. Based on your description, it seems like the humans come to approve of this (desire/prefer this) more than they approved of their lives before (or we could imagine similar scenarios where this is the case), and gain more pleasure from it, and you could have their approval (by assumption) outweigh the violation of the preferences for this not to happen. So, if you’re a welfarist consequentialist and a hedonist or a desire/preference theorist, and unless an individual’s future preferences count much less, this just seems better for those humans than what normal life has been like lately.
Some ways out seem to be:
Maybe certain preference-affecting views, or discounting future preferences, or antifrustrationism (basically negative preference utilitarianism), or something in these directions
Counting preferences more or less based on their specific contents, e.g. wanting to take heroin is a preference that counts less in your calculus
Non-hedonist and non-preferential/desire-based welfare (possibly in addition to hedonistic and preferential/desire-based welfare), e.g. objective goods/bads
Non-welfarist consequentialist values, i.e. valuing outcomes for reasons other than how they matter to individuals’ welfare
Non-consequentialism, e.g. constraints on violating preferences or consent, or not getting affirmative consent
Actually, whatever preferences they had before and were violated were stronger than the ones they have now and are satisfied
Actually, we should think much bigger; maybe we should optimize with artificial consciousness instead and do that a lot more.
If it’s something like 1 or 5, it should instead (or also?) model what the humans already want, and try to get that to happen.
Let’s please not have an object-level debate of hedonic utilitarianism (or whatever) here. It’s very off-topic.
I personally am strongly against strapping humans into chairs on heroin drips. If anyone reading this is in favor of the heroin-drip thing (which seems unlikely to me but what do I know?), then bad news for them is that there’s no strong reason to expect that from APTAMI either.
My discussion there in OP was not a prediction of what APTAMI will lead to, it was an illustrative example of one of a near-infinite number of weird edge cases in which its classifiers would give undesired results. In particular, I think hedonic utilitarians would also sorely regret turning on an APTAMI system as described. There is no moral philosophy that anyone subscribes to, such that they would have a strong reason to believe that turning on APTAMI would not be a catastrophic mistake by their own lights.
I think your scenario only illustrates a problem with outer alignment (picking the right objective function), and I think it’s possible to state an objective, that if it could be implemented sufficiently accurately and we could guarantee the AI followed it (inner alignment), would not result in a dystopia like this. If you think the model would do well at inner alignment if we fixed the problems with outer alignment, then it seems like a very promising direction and this would be worth pointing out and emphasizing.
I think the right direction is modelling how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents, especially how humans and other moral patients are affected (including what happened along the way, e.g. violations of consent and killing). I don’t think you need the coherency of coherent extrapolated volition, because you can still capture people finding this future substantially worse than some possible futures, including the ones where the AI doesn’t intervene, by some important lights, and just set a difference-making ambiguity averse objective. Or, maybe require it not to do substantially worse by any important light, if that’s feasible: we would allow the model to flexibly represent the lights by which humans judge outcomes where doing badly in one means doing badly overall. Then it would be incentivized to focus on acts that seem robustly better to humans.
I think an AI that actually followed such an objective properly would not, by the lights of whichever humans whose judgements it’s predicting, increase the risk of dystopia through its actions (although building it may be s-risky, in case of risks of minimization). Maybe it would cause us to slow moral progress and lock in the status quo, though. If the AI is smart enough, it can understand “how humans now (at the time before taking the action), without coercion or manipulation, would judge the future outcome if properly informed about its contents”, but it can still be hard to point the AI at that even if it did understand it.
Another approach I can imagine is to split up the rewards into periods, discount them temporally, check for approval and disapproval signals in each period, and make it very costly relative to anything else to miss one approval or receive a disapproval. I describe this more here and here. As JBlack pointed out in the comments of the second post, there’s incentive to hack the signal. However, as long as attempts to do so are risky enough by the lights of the AI and the AI is sufficiently averse to losing approval or getting disapproval and the risk of either is high enough, it wouldn’t do it. And of course, there’s still the problem of inner alignment; maybe it doesn’t even end up caring about approval/disapproval in the way our objective says it should out of distribution.
I’m really making two claims in this post:
First, that the vague proposal for Intrinsic Cost in APTAMI will almost definitely lead to AIs that want to kill us.
Second, that nobody has a better proposal for Intrinsic Cost that’s good enough that we would have a strong reason to believe that the AI won’t want to kill us.
Somewhere in between those two claims is a question of whether it’s possible to edit the APTAMI proposal so that it’s less bad—even if it doesn’t meet the higher bar of “strong reason to believe it won’t won’t want to kill us”. My answer is “absolutely yes”. The APTAMI proposal is so bad that I find it quite easy to think of ways to make it less bad. The thing you mentioned (i.e. that “when perceiving joy in nearby humans” is a poorly-thought-through phrase and we can do better) is indeed one example.
My main research topic is (more-or-less) how to write an Innate Cost module, sometimes directly and sometimes indirectly. I don’t currently have any plan that passes the bar of “strong reason to believe the AI won’t want to kill us”. I do have proposals that seem to pass the much lower bar of “it seems at least possible that the AI won’t want to kill us”—see here for a self-contained example. The inner alignment problem is definitely relevant.
I believe Steven didn’t imply that a significant number of people would approve or want such a future—indeed, the opposite, hence he called the scenario “dystopian”.
He basically meant that optimising surface signals of pleasure does not automatically lead to behaviours and plans congruent with reasonable ethics, so the surface elements of alignment suggested by LeCun in the paper are clearly insufficient.
I think many EAs/rationalists shouldn’t find this to be worse for humans than life today on the views they apparently endorse, because each human looks better off under standard approaches to intrapersonal aggregation: they get more pleasure, less suffering, more preference satisfaction (or we can imagine some kind of manipulation to achieve this), but at the cost of some important frustrated preferences.
EA/rationalists seems to me the community who gives more conscious attention to the problem of following your officially stated preferences down a cliff, a.k.a. Goodhart.
FWIW, I’m not sure if that’s true relative to the average person, and I’d guess non-consequentialist philosophers are more averse to biting bullets than the average EA and maybe rationalist.
I suspect that you read “conscious attention to the problem” in a different light than what I mean. To clarify:
Average person won’t go down a cliff on literal instructions, unless Moloch? Yes.
Average person will identify and understand such problem? No.
EA bites a bullet and does something weird? Yes.
EA bites a bullet because YAY BULLETS COOL? No.