Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.
A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.
To respond to some of your specific points:
I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do—I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being—everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.
Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.
Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you’re trying to do, for clarity. I’m happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I’ve talked to people into value alignment of ASI who said they “would bite that bullet”, in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I’m also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)
If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I’d file it under a positive offense defense balance, which would be great. If humanity doesn’t support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.
Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.
I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake
Again, I’m glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.
I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn’t recognize the existence of such risks. The AI will proceed sabotaging people’s unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.
I think people underestimate the amount of pushback we’re going to get once you get into pivotal act territory. That’s why I think it’s hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
So yes definitely agree with this. I don’t think lack of conscience or ethics is the issue though, but existential risk awareness.
In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.
Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.
I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time—when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.
Thanks for the comment. I think people have different conceptions of what “value aligning” an AI means. Currently, I think the best “value alignment” plan is to guardrail AI’s with an artificial conscience that approximates an ideal human conscience (the conscience of a good and wise human). Contained in our consciences are implicit values, such as those behind not stealing or killing except maybe in extreme circumstances.
A world in which “good” transformative AI agents have to autonomously go on the defensive against “bad” transformative AI agents seems pretty inevitable to me right now. I believe that when this happens, if we don’t have some sort of very workable conscience module in our “good” AI’s, the collateral damage of these “clashes” is going to be much greater than it otherwise would be. Basically what I’m saying is yes, it would be nice if we didn’t need to get “value alignment” of AI’s “right” under a tight timeline, but if we want to avoid some potentially huge bad effects in the world, I think we do.
To respond to some of your specific points:
I’m very unsure about how AI’s will evolve, so I don’t know if their system of ethics/conscience will end up being locked in or not, but this is a risk. This is part of why I’d like to do extensive testing and iterating to get an artificial conscience system as close to “final” as possible before it’s loaded into an AI agent that’s let loose in the world. I’d hope that the system of conscience we’d go with would support corrigibility so we could shut down the AI even if we couldn’t change its conscience/values.
I’m sure there will be plenty of unforeseen consequences (or “externalities”) arising from transformative AI, but if the conscience we load into AI’s is good enough, it should allow them to handle situations we’ve never thought of in a way that wise humans might do—I don’t think wise humans need to update their system of conscience with each new situation, they just have to suss out the situation to see how their conscience should apply to it.
I don’t know if there are moral facts, but something that seems to me to be on the level of a fact is that everyone cares about their own well-being—everyone wants to feel good in some way. Some people are very confused about how to go about doing this and do self-destructive acts, but ultimately they’re trying to feel good (or less bad) in some way. And most people have empathy, so they feel good when they think others feel good. I think this is the entire basis from which we should start for a universal, not-ever-gonna-change human value: we all want to feel good in some way. Then it’s just a question of understanding the “physics” of how we work and what makes us feel the most overall good (well-being) over the long-term. And I put forward the hypothesis that raising self-esteem is the best heuristic for raising overall well-being, and further, that increasing our responsibility level is the path to higher self-esteem (see Branden for the conception of “self-esteem” I’m talking about here).
I also consider AI’s replacing all humans to be an extremely bad outcome. I think it’s a result that someone with an “ideal” human conscience would actively avoid bringing about, and thus an AI with an artificial conscience based on an ideal human conscience (emphasizing responsibility) should do the same.
Ultimately, there’s a lot of uncertainty about the future, and I wouldn’t write off “value alignment” in the form of an artificial conscience just yet, even if there are risks involved with it.
Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you’re trying to do, for clarity. I’m happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I’ve talked to people into value alignment of ASI who said they “would bite that bullet”, in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I’m also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)
If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I’d file it under a positive offense defense balance, which would be great. If humanity doesn’t support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.
Yes, I think referring to it as “guard-railing with an artificial conscience” would be more clear than saying “value aligning,” thank you.
I believe that if there were no beings around who had real consciences (with consciousness and the ability to feel pain as two necessary pre-requisites to conscience), then there’d be no value in the world. No one to understand and measure or assign value means no value. And any being that doesn’t feel pain can’t understand value (nor feel real love, by the way). So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake. We most likely either got the artificial conscience wrong because that would’ve implicitly valued human life so wouldn’t have let a guard-railed AI wipe out humans, or we didn’t get an artificial conscience on board enough AI’s in time. An AI that had a “real” conscience also wouldn’t wipe out humans against the will of humans.
The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point. If literally everyone in the world said, “Hey, we all want to die,” then the guard-railed AI, if it thought the people were in their “right mind,” would respect their wishes and let them die.
All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.
Again, I’m glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.
I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn’t recognize the existence of such risks. The AI will proceed sabotaging people’s unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.
I think people underestimate the amount of pushback we’re going to get once you get into pivotal act territory. That’s why I think it’s hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.
So yes definitely agree with this. I don’t think lack of conscience or ethics is the issue though, but existential risk awareness.
In terms of doing a pivotal act (which is usually thought of as preemptive, I believe) or just whatever defensive acts were necessary to prevent catastrophe, I hope the AI would be advanced enough to make decent predictions of what the consequences of its actions could be in terms of losing “political capital,” etc., and then it would make its decisions strategically. Personally, if I had the opportunity to save the world from nuclear war, but everyone was going to hate me for it, I’d do it. But then, it wouldn’t matter that I lost the ability to affect anything after that like it would for a guard-railed AI that could do a huge amount of good after that if it weren’t shunned by society. Improving humans’ consciences and ethics would hopefully help avoid them hating the AI for saving them.
Also, if there were enough people, especially in power, who had strong consciences and senses of ethics, then maybe we’d be able to shift the political landscape from its current state of countries seemingly having different values and not trusting each other, to a world in which enforceable international agreements could be much more readily achieved.
I’m happy for people to work on increasing public awareness and trying for legislative “solutions,” but I think we should be working on artificial conscience at the same time—when there’s so much uncertainty about the future, it’s best to bet on a whole range of approaches, distributing your bets according to how likely you think different paths are to succeed. I think people are under-estimating the artificial conscience path right now, that’s all.
Thanks for all your comments!