First I’d say that utilitarianism is dangerous, and I’m glad you understand that.
But so are simplistic ideas. We have no idea how a forced-drugged future would turn out. For countless reasons, like a) humans don’t like being in intense pleasure all the time, we like stability with a few spikes, b) things could go wrong in some other way with nothing we could do.
In short, things could still go very wrong in option 1. Any outcome where the world is drastically altered and our freedom drastically curtailed is a very dangerous outcome, or might turn out much worse than it initially appears.
We won’t solve alignment in time. I think we need to invest more in political solutions or achieving a higher degree or world/tech control through other technologies like narrow AI itself. (With this I’m not in any way diminishing alignment work, we must do what we can obviously.)
Thanks for the feedback, I appended an edit to the end of the article which should help clarify my position and goals here.
I agree with you that alignment is unlikely to be solved in time. What I’m ultimately trying to say is that alignment is not an all-or-nothing problem, as seems to be generally assumed by the community. I presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field. I fully expect experts to be able to create at least slightly less dystopia-minded AGIs than I can imagine in an afternoon, but I don’t think that aiming for perfect alignment will be very productive, unless timelines are vastly longer than most suspect.
To give an analogy, it’s like preparing a fire safety system which can automatically put out any fires as soon as they start, vs. a system which simply gives people inside the building enough time to evacuate should a fire be reported. If you’re working on a deadline before a fire is predicted to appear, and the “safer” version will take longer than the deadline to install, than working on that version won’t save any lives, while the shoddier version will (although it may not save the building) be far more important to design.
“EDIT: I do not actually think that we should try to build an AI which will drug us to a questionably pleasurable mindless oblivion. Rather, the above post is meant to function as a parable of sorts, provoking readers into contemplating what a contingency plan for a “less horrible” partially aligned AGI might look like. Please do not act on this post without significant forethought.”
I understood! What I meant is that we might end up tricked, when choosing the “least bad option”. Of course it’s also true that there’s multiple levels on the scale of success/failure, but when you’re dealing with something that will be re-writing its own code most likely, things that don’t look too bad on paper might end up looking worse. Simplistic thoughts and choices might end up helping this. While I know that your situation 1 is a parable (though I can’t tell how realistic) it still seems like a simplistic thought that for instance, in my opinion, would actually be worse than extinction. I’d rather humanity go extinct than us being drugged all the time. It wouldn’t be an acceptable life! Pleasure is not happiness, and too much fries you up. A less half-baked example could be: drug humans but not with the goal of pleasure, rather mental stability with a few normal pleasure spikes (AND a few other safeguards on top of that).
But if your examples were just cartoonish then I take back what I said.
But of course it’s still true that there are multiple levels of failure, and some might be preferable to extinction. Eliezer’s “failed utopia 4-2” post seems like a pretty good example.
“I presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field.”
The question is that without perfect alignment, we really couldn’t be sure to direct AI to even one of your 3 examples which you consider easy—they’re probably not. Not without “perfect alignment”, because an advanced AI will probably be constantly re-writing its own code (recursive self improvement), so there are no certainties. We need formal proof of safety/control.
However, there might also be some weight to your side of the argument. I’m not an alignment researcher, I’ve just read a few books on the matter, so maybe someone with more expertise could give their 2 cents...
(I actually do hope there’s some weight to your argument, because it would make things much easier, and the future much less grim!)
I’ve heard people talk about failure modes, like “if things get too grim, engage paperclipper mode”, something like that lol. So there are worst-case scenario safeguards. But to the point that a more “realistic” alignment would be feasible, I still have my doubts.
The question is that without perfect alignment, we really couldn’t be sure to direct AI to even one of your 3 examples which you consider easy—they’re probably not. Not without “perfect alignment”, because an advanced AI will probably be constantly re-writing its own code (recursive self improvement), so there are no certainties. We need formal proof of safety/control.
If an advanced AI is editing its own code, it would only do so as part of its internal utility function, which it will want to keep stable (since changing its utility function would make achieving it much less likely). Therefore—at least as far as I can tell—we only need to worry about the initial utility function we assign it.
I’d place significant probability on us living in a world where a large chunk of alignment failures end up looking vaguely like one of the three examples I brought up, or at least convergent on a relatively small number of “attractors,” if you will.
First I’d say that utilitarianism is dangerous, and I’m glad you understand that.
But so are simplistic ideas. We have no idea how a forced-drugged future would turn out. For countless reasons, like a) humans don’t like being in intense pleasure all the time, we like stability with a few spikes, b) things could go wrong in some other way with nothing we could do.
In short, things could still go very wrong in option 1. Any outcome where the world is drastically altered and our freedom drastically curtailed is a very dangerous outcome, or might turn out much worse than it initially appears.
We won’t solve alignment in time. I think we need to invest more in political solutions or achieving a higher degree or world/tech control through other technologies like narrow AI itself. (With this I’m not in any way diminishing alignment work, we must do what we can obviously.)
Thanks for the feedback, I appended an edit to the end of the article which should help clarify my position and goals here. I agree with you that alignment is unlikely to be solved in time. What I’m ultimately trying to say is that alignment is not an all-or-nothing problem, as seems to be generally assumed by the community. I presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field. I fully expect experts to be able to create at least slightly less dystopia-minded AGIs than I can imagine in an afternoon, but I don’t think that aiming for perfect alignment will be very productive, unless timelines are vastly longer than most suspect. To give an analogy, it’s like preparing a fire safety system which can automatically put out any fires as soon as they start, vs. a system which simply gives people inside the building enough time to evacuate should a fire be reported. If you’re working on a deadline before a fire is predicted to appear, and the “safer” version will take longer than the deadline to install, than working on that version won’t save any lives, while the shoddier version will (although it may not save the building) be far more important to design.
“EDIT: I do not actually think that we should try to build an AI which will drug us to a questionably pleasurable mindless oblivion. Rather, the above post is meant to function as a parable of sorts, provoking readers into contemplating what a contingency plan for a “less horrible” partially aligned AGI might look like. Please do not act on this post without significant forethought.”
I understood! What I meant is that we might end up tricked, when choosing the “least bad option”. Of course it’s also true that there’s multiple levels on the scale of success/failure, but when you’re dealing with something that will be re-writing its own code most likely, things that don’t look too bad on paper might end up looking worse. Simplistic thoughts and choices might end up helping this. While I know that your situation 1 is a parable (though I can’t tell how realistic) it still seems like a simplistic thought that for instance, in my opinion, would actually be worse than extinction. I’d rather humanity go extinct than us being drugged all the time. It wouldn’t be an acceptable life! Pleasure is not happiness, and too much fries you up. A less half-baked example could be: drug humans but not with the goal of pleasure, rather mental stability with a few normal pleasure spikes (AND a few other safeguards on top of that).
But if your examples were just cartoonish then I take back what I said.
But of course it’s still true that there are multiple levels of failure, and some might be preferable to extinction. Eliezer’s “failed utopia 4-2” post seems like a pretty good example.
“I presented a range of possible outcomes above (all of which are more or less apocalyptic, unfortunately) which should be fairly easy to align an AGI towards, compared to the frankly utopian goals of current researchers in the field.”
The question is that without perfect alignment, we really couldn’t be sure to direct AI to even one of your 3 examples which you consider easy—they’re probably not. Not without “perfect alignment”, because an advanced AI will probably be constantly re-writing its own code (recursive self improvement), so there are no certainties. We need formal proof of safety/control.
However, there might also be some weight to your side of the argument. I’m not an alignment researcher, I’ve just read a few books on the matter, so maybe someone with more expertise could give their 2 cents...
(I actually do hope there’s some weight to your argument, because it would make things much easier, and the future much less grim!)
I’ve heard people talk about failure modes, like “if things get too grim, engage paperclipper mode”, something like that lol. So there are worst-case scenario safeguards. But to the point that a more “realistic” alignment would be feasible, I still have my doubts.
If an advanced AI is editing its own code, it would only do so as part of its internal utility function, which it will want to keep stable (since changing its utility function would make achieving it much less likely). Therefore—at least as far as I can tell—we only need to worry about the initial utility function we assign it.
I’d place significant probability on us living in a world where a large chunk of alignment failures end up looking vaguely like one of the three examples I brought up, or at least convergent on a relatively small number of “attractors,” if you will.