I think a crux here is I genuinely don’t think that we’d inevitably destroy/create a permanent dystopia with ASI by default (assuming it’s controlled/aligned, which I think is pretty likely), but I do think it’s reasonably plausible, so the main thing I’m more or less objecting to is the certainty involved here, rather than it’s plausibility.
I don’t think it’s inevitable, but I do think it’s the expected outcome. I agree I’m more suspicious of humans than most people, but obviously I also think I’m right.
People wig out when they get power, even collectively. Trying to ride herd on an AxI is bound to generate stress, tax cognitive capacities, and possibly engender paranoia. Almost everybody seems to have something they’d do if they were King of the World that a substantial number of other people would see as dystopian. One of the strong tendencies seems to be the wish to universalize rightthink, and real mind control might become possible with plausible technology. Grand Visions, moral panics, and purity spirals often rise to pandemic levels, but are presently constrained by being impossible to fully act on. And once you have the Correct World Order on the Most Important Issue, there’s a massive impulse to protect it regardless of any collateral damage.
the alignment problem being noticeably easier to solve than 10 years ago
I’m really unconvinced of that. I think people are deceived by their ability to get first-order good behavior in relatively constrained circumstances. I’m definitely totally unconvinced that any of the products that are out there now are “aligned” with anything importantly useful, and they are definitely easy mode.
Also, that’s without annoying complications like having to expect the model to advise you on things you literally can’t comprehend. I can believe that you and an ASI might end up agreeing on something, but when the ASI can’t convey all the information you’d need to have a truly informed opinion, who’s aligned with whom? How is it supposed to avoid manipulating you, no matter whether it wants to, if it has to reduce a set of ideas that fundamentally won’t fit into your head into something you can give it an opinion on?
Mind you, I don’t know how to do “friendliness” any more than I know how to do “intent alignment”. But I know which one I’d pick.
[Oh, and on edit to be clear, what I was asking for with the original post was not so much to abandon human control as obviously unacceptable, no matter how suspicious I am of it personally. It was to stop treating any solution that didn’t involve human control as axiomatically unacceptable, without regard to other outcomes. If somebody does solve friendliness, use it, FFS, especially if that solution actually turns out to be more reliable than any available alternative human-control solution.]
It was to stop treating any solution that didn’t involve human control as axiomatically unacceptable, without regard to other outcomes.
The issue is that it’s unclear if it’s acceptable, so should be avoided if at all possible, pending more consideration. In principle there is more time for that than what’s relevant for any other concerns that don’t involve the risk of losing control in a less voluntary way. The revealed preference looks the same as finding it unacceptable to give up the potential for human control, but the argument is different, so long term implied behavior following from that argument is different. It might only take a million years to decide to give up control.
By this, are you not assuming that keeping humans in charge is extremely unlikely to result in a short-term catastrophe? You may not get a million years or even a hundred years.
By the way, I think the worst risk from human control isn’t extinction. The worse, and more likely, risk, is some kind of narrow, fanatical value system being imposed universally, very possibly by direct mind control. I’d expect “safeguards” to be set up to make sure that the world won’t drift away from that system… not even in a million years. And the collateral damage from the safeguards would probably be worse than the limitations imposed by the base value system.
I would expect the mind control to apply more to the humans “in charge” than to the rest.
I’m not making any claims about feasibility, I only dispute the claim that it’s known that permanently giving up the potential for human control is an acceptable thing to do, or that making such a call (epistemic call about what is known) is reasonable in the foreseeable future. To the extent it’s possible to defer this call, it should therefore be deferred (this is a normative claim, not a plan or a prediction of feasibility). If it’s not possible to keep the potential for human control despite this uncertainty, then it’s not possible, but that won’t be because the uncertainty got resolved to the extent that it could be humanly resolved.
I don’t think it’s inevitable, but I do think it’s the expected outcome. I agree I’m more suspicious of humans than most people, but obviously I also think I’m right.
People wig out when they get power, even collectively. Trying to ride herd on an AxI is bound to generate stress, tax cognitive capacities, and possibly engender paranoia. Almost everybody seems to have something they’d do if they were King of the World that a substantial number of other people would see as dystopian. One of the strong tendencies seems to be the wish to universalize rightthink, and real mind control might become possible with plausible technology. Grand Visions, moral panics, and purity spirals often rise to pandemic levels, but are presently constrained by being impossible to fully act on. And once you have the Correct World Order on the Most Important Issue, there’s a massive impulse to protect it regardless of any collateral damage.
Agree with this (with the caveat that dystopian worlds are relative to your values).
I’m really unconvinced of that. I think people are deceived by their ability to get first-order good behavior in relatively constrained circumstances. I’m definitely totally unconvinced that any of the products that are out there now are “aligned” with anything importantly useful, and they are definitely easy mode.
I think a crux is that I consider the opposite problem of people needing to search for an essence and ignoring the behavioral aspects a more serious problem than people overgeneralizing from first-order good behavior in reasonably constrained circumstances, because it’s way too easy to assume that there must be a platonic essence of a thing that is almost ineffable and inscrutable to empirical study.
More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn’t track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.
Alos, that’s without annoying complications like having to expect the model to advise you on things you literally can’t comprehend. I can believe that you and an ASI might end up agreeing on something, but when the ASI can’t convey all the information you’d need to have a truly informed opinion, who’s aligned with whom? How is it supposed to avoid manipulating you, no matter whether it wants to, if it has to reduce a set of ideas that fundamentally won’t fit into your head into something you can give it an opinion on?
Yeah, this does mean you can’t have too strict of a definition of manipulation, and it’s important to note that even aligned AI probably makes us pets over time (with the caveat that instruction following/corrigibility may extend this time immensely, and augmentation of certain humans may make them the ultimate controllers of the future in an abstract sense.)
More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn’t track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.
I have trouble with the word “alignment”, although even I find myself slipping into that terminology occasionally now. What I really want is good behavior. And as you say, that’s good behavior by my values. Which I hope are closer to the values of the average person with influence over AI development than they are to the values of the global average human.
Since I don’t expect good behavior from humans, I don’t think it’s adequate to have AI that’s even 100 percent aligned, in terms of behaviorally revealed preferences, with humans-in-general as represented by the training data. A particular danger for AI is that it’s pretty common for humans, or even significant groups of humans, to get into weird corner cases and obsess over particular issues to the exclusion of things that other humans would think are more important… something that’s encouraged by targeted interventions like RLHF. Fanatically “aligned” AI could be pretty darned dystopian. But even “alignment” with the average person could result in disaster.
If you look at it in terms of of stated preferences instead of revealed preferences, I think it gets even worse. Most of ethical philosophy looks to me like humans trying to come up with post hoc ways to make “logical necessities” out of values and behaviors (or “intuitions”) that they were going to prefer anyway. If you follow the implications of the resulting systems a little bit beyond wherever their inventors stopped thinking, they usually come into violent conflict with other intuitions that are often at least as important.
If you then add the caveat that it’s only 50 to 70 percent “aligned”… well, would you want to have to deal with a human that only agreed with you 50 to 70 percent of the time on what behavior was good? Especially on big issues? I think that, on most ways of “measuring” it, the vast majority of humans are probably much better than 50 to 70 percent “aligned” with one another… but humans still aren’t mutually aligned enough to avoid massive violent conflicts over stated values, let alone massive violent conflicts over object-level outcomes.
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
I don’t think it’s inevitable, but I do think it’s the expected outcome. I agree I’m more suspicious of humans than most people, but obviously I also think I’m right.
People wig out when they get power, even collectively. Trying to ride herd on an AxI is bound to generate stress, tax cognitive capacities, and possibly engender paranoia. Almost everybody seems to have something they’d do if they were King of the World that a substantial number of other people would see as dystopian. One of the strong tendencies seems to be the wish to universalize rightthink, and real mind control might become possible with plausible technology. Grand Visions, moral panics, and purity spirals often rise to pandemic levels, but are presently constrained by being impossible to fully act on. And once you have the Correct World Order on the Most Important Issue, there’s a massive impulse to protect it regardless of any collateral damage.
I’m really unconvinced of that. I think people are deceived by their ability to get first-order good behavior in relatively constrained circumstances. I’m definitely totally unconvinced that any of the products that are out there now are “aligned” with anything importantly useful, and they are definitely easy mode.
Also, that’s without annoying complications like having to expect the model to advise you on things you literally can’t comprehend. I can believe that you and an ASI might end up agreeing on something, but when the ASI can’t convey all the information you’d need to have a truly informed opinion, who’s aligned with whom? How is it supposed to avoid manipulating you, no matter whether it wants to, if it has to reduce a set of ideas that fundamentally won’t fit into your head into something you can give it an opinion on?
Mind you, I don’t know how to do “friendliness” any more than I know how to do “intent alignment”. But I know which one I’d pick.
[Oh, and on edit to be clear, what I was asking for with the original post was not so much to abandon human control as obviously unacceptable, no matter how suspicious I am of it personally. It was to stop treating any solution that didn’t involve human control as axiomatically unacceptable, without regard to other outcomes. If somebody does solve friendliness, use it, FFS, especially if that solution actually turns out to be more reliable than any available alternative human-control solution.]
The issue is that it’s unclear if it’s acceptable, so should be avoided if at all possible, pending more consideration. In principle there is more time for that than what’s relevant for any other concerns that don’t involve the risk of losing control in a less voluntary way. The revealed preference looks the same as finding it unacceptable to give up the potential for human control, but the argument is different, so long term implied behavior following from that argument is different. It might only take a million years to decide to give up control.
By this, are you not assuming that keeping humans in charge is extremely unlikely to result in a short-term catastrophe? You may not get a million years or even a hundred years.
By the way, I think the worst risk from human control isn’t extinction. The worse, and more likely, risk, is some kind of narrow, fanatical value system being imposed universally, very possibly by direct mind control. I’d expect “safeguards” to be set up to make sure that the world won’t drift away from that system… not even in a million years. And the collateral damage from the safeguards would probably be worse than the limitations imposed by the base value system.
I would expect the mind control to apply more to the humans “in charge” than to the rest.
I’m not making any claims about feasibility, I only dispute the claim that it’s known that permanently giving up the potential for human control is an acceptable thing to do, or that making such a call (epistemic call about what is known) is reasonable in the foreseeable future. To the extent it’s possible to defer this call, it should therefore be deferred (this is a normative claim, not a plan or a prediction of feasibility). If it’s not possible to keep the potential for human control despite this uncertainty, then it’s not possible, but that won’t be because the uncertainty got resolved to the extent that it could be humanly resolved.
Agree with this (with the caveat that dystopian worlds are relative to your values).
I think a crux is that I consider the opposite problem of people needing to search for an essence and ignoring the behavioral aspects a more serious problem than people overgeneralizing from first-order good behavior in reasonably constrained circumstances, because it’s way too easy to assume that there must be a platonic essence of a thing that is almost ineffable and inscrutable to empirical study.
More generally, a crux here is that I believe most of the alignment-relevant parts of the AIs are in large part the data it was trained on, combined with me believing that the adversarial examples where human language doesn’t track reality to be less important for alignment than a lot of people, and thus training on human data does implicitly align them 50-70% of the way towards human values at minimum.
Yeah, this does mean you can’t have too strict of a definition of manipulation, and it’s important to note that even aligned AI probably makes us pets over time (with the caveat that instruction following/corrigibility may extend this time immensely, and augmentation of certain humans may make them the ultimate controllers of the future in an abstract sense.)
I have trouble with the word “alignment”, although even I find myself slipping into that terminology occasionally now. What I really want is good behavior. And as you say, that’s good behavior by my values. Which I hope are closer to the values of the average person with influence over AI development than they are to the values of the global average human.
Since I don’t expect good behavior from humans, I don’t think it’s adequate to have AI that’s even 100 percent aligned, in terms of behaviorally revealed preferences, with humans-in-general as represented by the training data. A particular danger for AI is that it’s pretty common for humans, or even significant groups of humans, to get into weird corner cases and obsess over particular issues to the exclusion of things that other humans would think are more important… something that’s encouraged by targeted interventions like RLHF. Fanatically “aligned” AI could be pretty darned dystopian. But even “alignment” with the average person could result in disaster.
If you look at it in terms of of stated preferences instead of revealed preferences, I think it gets even worse. Most of ethical philosophy looks to me like humans trying to come up with post hoc ways to make “logical necessities” out of values and behaviors (or “intuitions”) that they were going to prefer anyway. If you follow the implications of the resulting systems a little bit beyond wherever their inventors stopped thinking, they usually come into violent conflict with other intuitions that are often at least as important.
If you then add the caveat that it’s only 50 to 70 percent “aligned”… well, would you want to have to deal with a human that only agreed with you 50 to 70 percent of the time on what behavior was good? Especially on big issues? I think that, on most ways of “measuring” it, the vast majority of humans are probably much better than 50 to 70 percent “aligned” with one another… but humans still aren’t mutually aligned enough to avoid massive violent conflicts over stated values, let alone massive violent conflicts over object-level outcomes.
To the extent that I understand your position, it’s that sharing a lot of values doesn’t automatically imply that AI is safe/non-dystopian to your values if built, rather than saying that alignment is hard/impossible to someone’s values (note when I say that a model is aligned, I am always focused on aligning it to one person’s values).
I also dislike the terminology, and I actually agree that alignment is not equal to safety, and this is probably one of my disagreements between a lot of LWers and myself, where I don’t think alignment automatically makes things better (in fact, things can get worse by making alignment better).
For example, it does not rule out this scenario, where the species doesn’t literally go extinct, but lots of humans die because the economic incentives for not stealing/using violence fall apart as humans become effectively worthless on the market:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
Yes, with the caveat that I am not thereby saying that it’s not hard to align to even one person’s values.
Fair enough.
I admittedly have a lot of agreement with you, and that’s despite thinking we can make machines that do follow orders/are intent-aligned ala Seth Herd’s definition:
https://www.lesswrong.com/posts/7NvKrqoQgJkZJmcuD/instruction-following-agi-is-easier-and-more-likely-than