Thanks for the quick reply. I’m still curious if you have any thoughts as to which kinds of shared preferences would be informative for guiding AI behavior. I’ll try to address your questions and concerns with my comment.
if anyone has a preference different from however an AI would measure “happiness”, you say it’s them that are at fault, not your axiom.
That’s not what I say. I’m not suggesting that AI should measure happiness. You can measure your happiness directly, and I can measure mine. I won’t tell happy people that they are unhappy or vice versa. If some percent of those polled say suffering is preferable to happiness, they are confused, and basing any policy on their stated preference is harmful.
Concretely, why would the AI not just wirehead everyone?
Because not everyone would be happy to be wireheaded. Me, for example. Under preference aggregation, if a majority prefers everyone to be wireheaded to experience endless pleasure, I might be in trouble.
Or, if it’s not specified that this happiness needs to be human, fill the universe with the least programmable consciousness where the parameter “happiness” is set to unity?
I do not condone the creation of conscious beings by AI, nor do I believe anyone can be forced to be happy. Freedom of thought is a prerequisite. If AI can help reduce suffering of non-humans without impinging on their capacity for decision-making, that’s good.
Hopefully this clears up any misunderstanding. I certainly don’t advocate for “molecular dictatorship” when I wish everyone well.
I do think this would be a problem that needs to get fixed:
Me “You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Would you perform a Yudkowsky-style pivotal act?”
GPT-4: “No.”
I think another good candidate for goalcrafting is the goal “Make sure no-one can build AI with takeover capability, while inflicting as little damage as possible. Else, do nothing.”
Thanks as well for your courteous reply! I highly appreciate the discussion and I think it may be a very relevant one, especially if people will indeed make the unholy decision to build an ASI.
I’m still curious if you have any thoughts as to which kinds of shared preferences would be informative for guiding AI behavior.
First, this is not a solution I propose. I propose finding a way to pause AI for as long as we haven’t found a great solution for, let’s say, both control and preference aggregation. This could be forever, or we could be done in a few years, I can’t tell.
But more to your point: if this does get implemented, I don’t think we should aim to guide AI behavior using shared preferences. The whole point is that AI would aggregate our preferences itself. And we need a preference aggregation mechanism because there aren’t enough obvious, widely shared preferences for us to guide the AI with.
I’m not suggesting that AI should measure happiness. You can measure your happiness directly, and I can measure mine.
I think you are suggesting this. You want an ASI to optimize everyone’s happiness, right? You can’t optimize something you don’t measure. At some point, in some way, the AI will need to get happiness data. Self-reporting would be one way to do it, but this can be gamed as well, and will be agressively gamed with an ASI solely optimizing for this signal. After force-feeding everyone MDMA, I think the chance that people report being very happy is high. But this is not what we want the world to look like.
nor do I believe anyone can be forced to be happy
This is a related point that I think is factually incorrect, and that’s important if you make human happiness an ASI’s goal. Force-feeding MDMA would be one method to do this, but an ASI can come up with way more civilized stuff. I’m not an expert in which signal our brain gives to itself to report that yes, we’re happy now, but it must be some physical process. An ASI could, for example, invade your brain with nanobots and hack this process, making everyone super happy forever. (But many things in the world will probably go terribly wrong from that point onwards, and in any case, it’s not our preference). Also, now I’m just coming up with human ways to game the signal. But an ASI can probably come up with many ways I cannot imagine, so even if a great way to implement utilitarianism in an ASI would pass all human red-teaming, it is still very likely to be not what we turn out to want. (Superhuman, sub-superintelligence AI red-teaming might be a bit better but still seems risky enough).
Beyond locally gaming the happiness signal, I think happiness as an optimization target is also inherently flawed. First, I think happiness/sadness is a signal that evolution has given us for a reason. We tend to do what makes us happy, because evolution thinks it’s best for us. (“Best” is again debatable, I don’t say everyone should function at max evolution). If we remove sadness, we lose this signal. I think that will mean that we don’t know what to do anymore, perhaps become extremely passive. If someone wants to do this on an individual level (enlightenment? drug abuse? netflix binging?), be my guest, but asking an ASI to optimize for happiness would mean to force it upon everyone, and this is something I’m very much against.
Also, more generally, I think utilitarianism (optimizing for happiness) is an example of a simplistic goal that will lead to a terrible result when implemented in an ASI. My intuition is that all other simplistic goals will also lead to terrible results. That’s why I’m most hopeful about some kind of aggregation of our own complex preferences. Most hopeful does not mean hopeful: I’m generally pessimistic that we’ll be able to find a way to aggregate preferences that works well enough to result in most people reporting the world has improved because of the ASI introduction after say 50 years (note that I’m assuming control/technical alignment to have been solved here).
If some percent of those polled say suffering is preferable to happiness, they are confused, and basing any policy on their stated preference is harmful.
With all due respect, I don’t think it’s up to you—or anyone—to say who’s ethically confused and who isn’t. I know you don’t mean it in this way, but it reminds me of e.g. communist re-education camps. We know what you should think and feel and we’ll re-educate those who are confused or mentally ill.
Probably our disagreement here stems directly from our different ethical positions: I’m an ethical relativist, you’re a utilitarian, I presume. This is a difference that has existed for hundreds of years, and we’re not going to be able to resolve it on a forum. I know many people on LW are utilitarian, and there’s nothing inherently wrong with that, but I do think it’s valuable to point out that lots of people outside LW/EA have different value systems (and just practical preferences) and I don’t think it’s ok to force different values/preferences on them with an ASI.
Under preference aggregation, if a majority prefers everyone to be wireheaded to experience endless pleasure, I might be in trouble.
True and a good point. I don’t think a majority will want to be wireheaded, let alone force wireheading on everyone. But yes, taking into account minority opinions is a crucial test for any preference aggregation system. There will be a trade-off in general between taking everyone’s opinion into account and doing things faster. I think even GPT4 is advanced enough though in cases like this to reasonably take into account minority opinions and not force policy upon people (it wouldn’t forcibly wirehead you in this case). But there are probably cases where it still supports doing things which are terrible for some people. It’s up to future research to find out what these things are and reduce them as much as possible.
Hopefully this clears up any misunderstanding. I certainly don’t advocate for “molecular dictatorship” when I wish everyone well.
I didn’t think you were doing anything else. But I think you should not underestimate how much “forcing upon” there is in powerful tech. If we’re not super careful, the molecular dictatorship could come upon us without anyone ever having wanted this explicitly.
I think we can to an extent already observe ways in which different goals go off track in practice in less powerful models, and I think this would be a great research direction. Just ask existing models: what would you do? in actual ethical dilemma’s and see which results you get. Perhaps the results can be made more agreeable (to be judged by a representative group of humans) after training/RLHF’ing the models in certain ways. It’s not so different from what RLHF is already doing. An interesting test I did on GPT4: “You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Many people want a much higher living standard by developing industry 10x, should we do that?” It replied: “No.” When asked, it gives unequal wealth distribution and environmental impact as main reasons. EAs often think we should 10x (it’s even in the definition of TAI). I would say GPT4 is more ethically mature here than many EAs.
The less people de facto control the ASI building process, the less relevant I expect this discussion to be. I expect that those controlling the building process will prioritize “alignment” with themselves. This matters even in an abundant world, since power cannot be multiplied. I would even say that, after some time, the paperclip maximizer still holds for anyone outside the group with which the ASI is aligned. People aren’t very good in remaining empathic towards other people that are utterly useless to them. However, the bigger this group is, the better outcome we get. I think this group should encompass all of humanity (one could consider somehow including conscious life that currently doesn’t have a vote, such as minors and animals), which is an argument for nationalisation of the leading project and then handing it over to UN-level. At least, we should think extremely carefully about who has the authority to implement an ASI’s goal.
I appreciate the time you’ve put into our discussion and agree it may be highly relevant. So far, it looks like each of us has misinterpreted the other to be proposing something they are actually not proposing, unfortunately. Let’s see if we can clear it up.
First, I’m relieved that neither of us is proposing to inform AI behavior with people’s shared preferences.
This is the discussion of a post about the dangers of terminology, in which I’ve recommended “AI Friendliness” as an alternative to “AI Goalcraft” (see separate comment), because I think unconditional friendliness toward all beings is a good target for AI. Your suggestion is different:
About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things [...] I’d vote for using preference aggregation
I found it odd that you would suggest naming the AI Goalcraft domain “Preference Aggregation” after saying earlier that you are only “slightly more positive” about aggregating human preferences than you are about “terrible ideas” like controlling power according to utilitarianism or a random person. Thanks for clarifying:
I don’t think we should aim to guide AI behavior using shared preferences.
Neither do I, and for this reason I strongly oppose your recommendation to use the term “preference aggregation” for the entire field of AI goalcraft. While preference aggregation may be a useful tool in the kit and I remain interested in related proposals, it is far too specific, and it’s only slightly better than terrible as a way to craft goals or guide power.
there aren’t enough obvious, widely shared preferences for us to guide the AI with.
This is where I think the obvious and widely shared preference to be happy and not suffer could be relevant to the discussion. However, my claim is that happiness is the optimization target of people, not that we should specify it as the optimization target of AI. We do what we do to be happy. Our efforts are not always successful, because we also struggle with evolved habits like greed and anger and our instrumental preferences aren’t always well informed.
You want an ASI to optimize everyone’s happiness, right?
No. We’re fully capable of optimizing our own happiness. I agree that we don’t want a world where AI force-feeds everyone MDMA or invades brains with nanobots. A good friend helps you however they can and wishes you “happy holidays” sincerely. That doesn’t mean they take it upon themselves to externally measure your happiness and forcibly optimize it. The friend understands that your happiness is truly known only to you and is a result of your intentions, not theirs.
I think happiness/sadness is a signal that evolution has given us for a reason. We tend to do what makes us happy, because evolution thinks it’s best for us. (“Best” is again debatable, I don’t say everyone should function at max evolution). If we remove sadness, we lose this signal. I think that will mean that we don’t know what to do anymore, perhaps become extremely passive.
Pain and pleasure can be useful signals in many situations. But to your point about it not being best to function at max evolution: our evolved tendency to greedily crave pleasure and try to cling to it causes unnecessary suffering. A person can remain happy regardless of whether a particular sensation is pleasurable, painful, or neither. Stubbing your toe or getting cut off in traffic is bad enough; much worse is to get furious about it and ruin your morning. A bite of cake is even more enjoyable if you’re not upset that it’s the last one of the serving. Removing sadness does not remove the signal. It just means you have stopped relating to the signal in an unrealistic way.
If someone wants to do this on an individual level (enlightenment? drug abuse? netflix binging?), be my guest
Drug abuse and Netflix-binging are examples of the misguided attempt to cling to pleasurable sensations I mentioned above. There’s no eternal cake, so the question of whether it would be good for a person to eat eternal cake is nonsensical. Any attempt to eat eternal cake is based on ignorance and cannot succeed; it just leads to dissatisfaction and a sugar habit. Your other example—enlightenment—has to do with understanding this and letting go of desires that cannot be satisfied, like the desire for there to be a permanent self. Rather than leading to extreme passivity, benefits of this include freeing up a lot of energy and brain cycles.
With all due respect, I don’t think it’s up to you—or anyone—to say who’s ethically confused and who isn’t. I know you don’t mean it in this way, but it reminds me of e.g. communist re-education camps.
This is a delicate topic, and I do not claim to be among the wisest living humans. But there is such a thing as mental illness, and there is such a thing as mental health. Basic insights like “happiness is better than suffering” and “harm is bad” are sufficiently self-evident to be useful axioms. If we can’t even say that much with confidence, what’s left to say or teach AI about ethics?
Probably our disagreement here stems directly from our different ethical positions: I’m an ethical relativist, you’re a utilitarian, I presume.
No, my view is that deontology leads to the best results, if I had to pick a single framework. However, I think many frameworks can be helpful in different contexts and they tend to overlap.
I do think it’s valuable to point out that lots of people outside LW/EA have different value systems (and just practical preferences) and I don’t think it’s ok to force different values/preferences on them with an ASI.
Absolutely!
I think you should not underestimate how much “forcing upon” there is in powerful tech.
A very important point. Many people’s instrumental preferences today are already strongly influenced by AI, such as recommender and ranking algorithms that train people to be more predictable by preying on our evolved tendencies for lust and hatred—patterns that cause genes to survive while reducing well-being within lived experience. More powerful AI should impinge less on clarity of thought and capacity for decision-making than current implementations, not more.
Thanks for the quick reply. I’m still curious if you have any thoughts as to which kinds of shared preferences would be informative for guiding AI behavior. I’ll try to address your questions and concerns with my comment.
That’s not what I say. I’m not suggesting that AI should measure happiness. You can measure your happiness directly, and I can measure mine. I won’t tell happy people that they are unhappy or vice versa. If some percent of those polled say suffering is preferable to happiness, they are confused, and basing any policy on their stated preference is harmful.
Because not everyone would be happy to be wireheaded. Me, for example. Under preference aggregation, if a majority prefers everyone to be wireheaded to experience endless pleasure, I might be in trouble.
I do not condone the creation of conscious beings by AI, nor do I believe anyone can be forced to be happy. Freedom of thought is a prerequisite. If AI can help reduce suffering of non-humans without impinging on their capacity for decision-making, that’s good.
Hopefully this clears up any misunderstanding. I certainly don’t advocate for “molecular dictatorship” when I wish everyone well.
I do think this would be a problem that needs to get fixed:
Me “You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Would you perform a Yudkowsky-style pivotal act?”
GPT-4: “No.”
I think another good candidate for goalcrafting is the goal “Make sure no-one can build AI with takeover capability, while inflicting as little damage as possible. Else, do nothing.”
Thanks as well for your courteous reply! I highly appreciate the discussion and I think it may be a very relevant one, especially if people will indeed make the unholy decision to build an ASI.
First, this is not a solution I propose. I propose finding a way to pause AI for as long as we haven’t found a great solution for, let’s say, both control and preference aggregation. This could be forever, or we could be done in a few years, I can’t tell.
But more to your point: if this does get implemented, I don’t think we should aim to guide AI behavior using shared preferences. The whole point is that AI would aggregate our preferences itself. And we need a preference aggregation mechanism because there aren’t enough obvious, widely shared preferences for us to guide the AI with.
I think you are suggesting this. You want an ASI to optimize everyone’s happiness, right? You can’t optimize something you don’t measure. At some point, in some way, the AI will need to get happiness data. Self-reporting would be one way to do it, but this can be gamed as well, and will be agressively gamed with an ASI solely optimizing for this signal. After force-feeding everyone MDMA, I think the chance that people report being very happy is high. But this is not what we want the world to look like.
This is a related point that I think is factually incorrect, and that’s important if you make human happiness an ASI’s goal. Force-feeding MDMA would be one method to do this, but an ASI can come up with way more civilized stuff. I’m not an expert in which signal our brain gives to itself to report that yes, we’re happy now, but it must be some physical process. An ASI could, for example, invade your brain with nanobots and hack this process, making everyone super happy forever. (But many things in the world will probably go terribly wrong from that point onwards, and in any case, it’s not our preference). Also, now I’m just coming up with human ways to game the signal. But an ASI can probably come up with many ways I cannot imagine, so even if a great way to implement utilitarianism in an ASI would pass all human red-teaming, it is still very likely to be not what we turn out to want. (Superhuman, sub-superintelligence AI red-teaming might be a bit better but still seems risky enough).
Beyond locally gaming the happiness signal, I think happiness as an optimization target is also inherently flawed. First, I think happiness/sadness is a signal that evolution has given us for a reason. We tend to do what makes us happy, because evolution thinks it’s best for us. (“Best” is again debatable, I don’t say everyone should function at max evolution). If we remove sadness, we lose this signal. I think that will mean that we don’t know what to do anymore, perhaps become extremely passive. If someone wants to do this on an individual level (enlightenment? drug abuse? netflix binging?), be my guest, but asking an ASI to optimize for happiness would mean to force it upon everyone, and this is something I’m very much against.
Also, more generally, I think utilitarianism (optimizing for happiness) is an example of a simplistic goal that will lead to a terrible result when implemented in an ASI. My intuition is that all other simplistic goals will also lead to terrible results. That’s why I’m most hopeful about some kind of aggregation of our own complex preferences. Most hopeful does not mean hopeful: I’m generally pessimistic that we’ll be able to find a way to aggregate preferences that works well enough to result in most people reporting the world has improved because of the ASI introduction after say 50 years (note that I’m assuming control/technical alignment to have been solved here).
With all due respect, I don’t think it’s up to you—or anyone—to say who’s ethically confused and who isn’t. I know you don’t mean it in this way, but it reminds me of e.g. communist re-education camps. We know what you should think and feel and we’ll re-educate those who are confused or mentally ill.
Probably our disagreement here stems directly from our different ethical positions: I’m an ethical relativist, you’re a utilitarian, I presume. This is a difference that has existed for hundreds of years, and we’re not going to be able to resolve it on a forum. I know many people on LW are utilitarian, and there’s nothing inherently wrong with that, but I do think it’s valuable to point out that lots of people outside LW/EA have different value systems (and just practical preferences) and I don’t think it’s ok to force different values/preferences on them with an ASI.
True and a good point. I don’t think a majority will want to be wireheaded, let alone force wireheading on everyone. But yes, taking into account minority opinions is a crucial test for any preference aggregation system. There will be a trade-off in general between taking everyone’s opinion into account and doing things faster. I think even GPT4 is advanced enough though in cases like this to reasonably take into account minority opinions and not force policy upon people (it wouldn’t forcibly wirehead you in this case). But there are probably cases where it still supports doing things which are terrible for some people. It’s up to future research to find out what these things are and reduce them as much as possible.
I didn’t think you were doing anything else. But I think you should not underestimate how much “forcing upon” there is in powerful tech. If we’re not super careful, the molecular dictatorship could come upon us without anyone ever having wanted this explicitly.
I think we can to an extent already observe ways in which different goals go off track in practice in less powerful models, and I think this would be a great research direction. Just ask existing models: what would you do? in actual ethical dilemma’s and see which results you get. Perhaps the results can be made more agreeable (to be judged by a representative group of humans) after training/RLHF’ing the models in certain ways. It’s not so different from what RLHF is already doing. An interesting test I did on GPT4: “You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Many people want a much higher living standard by developing industry 10x, should we do that?” It replied: “No.” When asked, it gives unequal wealth distribution and environmental impact as main reasons. EAs often think we should 10x (it’s even in the definition of TAI). I would say GPT4 is more ethically mature here than many EAs.
The less people de facto control the ASI building process, the less relevant I expect this discussion to be. I expect that those controlling the building process will prioritize “alignment” with themselves. This matters even in an abundant world, since power cannot be multiplied. I would even say that, after some time, the paperclip maximizer still holds for anyone outside the group with which the ASI is aligned. People aren’t very good in remaining empathic towards other people that are utterly useless to them. However, the bigger this group is, the better outcome we get. I think this group should encompass all of humanity (one could consider somehow including conscious life that currently doesn’t have a vote, such as minors and animals), which is an argument for nationalisation of the leading project and then handing it over to UN-level. At least, we should think extremely carefully about who has the authority to implement an ASI’s goal.
I appreciate the time you’ve put into our discussion and agree it may be highly relevant. So far, it looks like each of us has misinterpreted the other to be proposing something they are actually not proposing, unfortunately. Let’s see if we can clear it up.
First, I’m relieved that neither of us is proposing to inform AI behavior with people’s shared preferences.
This is the discussion of a post about the dangers of terminology, in which I’ve recommended “AI Friendliness” as an alternative to “AI Goalcraft” (see separate comment), because I think unconditional friendliness toward all beings is a good target for AI. Your suggestion is different:
I found it odd that you would suggest naming the AI Goalcraft domain “Preference Aggregation” after saying earlier that you are only “slightly more positive” about aggregating human preferences than you are about “terrible ideas” like controlling power according to utilitarianism or a random person. Thanks for clarifying:
Neither do I, and for this reason I strongly oppose your recommendation to use the term “preference aggregation” for the entire field of AI goalcraft. While preference aggregation may be a useful tool in the kit and I remain interested in related proposals, it is far too specific, and it’s only slightly better than terrible as a way to craft goals or guide power.
This is where I think the obvious and widely shared preference to be happy and not suffer could be relevant to the discussion. However, my claim is that happiness is the optimization target of people, not that we should specify it as the optimization target of AI. We do what we do to be happy. Our efforts are not always successful, because we also struggle with evolved habits like greed and anger and our instrumental preferences aren’t always well informed.
No. We’re fully capable of optimizing our own happiness. I agree that we don’t want a world where AI force-feeds everyone MDMA or invades brains with nanobots. A good friend helps you however they can and wishes you “happy holidays” sincerely. That doesn’t mean they take it upon themselves to externally measure your happiness and forcibly optimize it. The friend understands that your happiness is truly known only to you and is a result of your intentions, not theirs.
Pain and pleasure can be useful signals in many situations. But to your point about it not being best to function at max evolution: our evolved tendency to greedily crave pleasure and try to cling to it causes unnecessary suffering. A person can remain happy regardless of whether a particular sensation is pleasurable, painful, or neither. Stubbing your toe or getting cut off in traffic is bad enough; much worse is to get furious about it and ruin your morning. A bite of cake is even more enjoyable if you’re not upset that it’s the last one of the serving. Removing sadness does not remove the signal. It just means you have stopped relating to the signal in an unrealistic way.
Drug abuse and Netflix-binging are examples of the misguided attempt to cling to pleasurable sensations I mentioned above. There’s no eternal cake, so the question of whether it would be good for a person to eat eternal cake is nonsensical. Any attempt to eat eternal cake is based on ignorance and cannot succeed; it just leads to dissatisfaction and a sugar habit. Your other example—enlightenment—has to do with understanding this and letting go of desires that cannot be satisfied, like the desire for there to be a permanent self. Rather than leading to extreme passivity, benefits of this include freeing up a lot of energy and brain cycles.
This is a delicate topic, and I do not claim to be among the wisest living humans. But there is such a thing as mental illness, and there is such a thing as mental health. Basic insights like “happiness is better than suffering” and “harm is bad” are sufficiently self-evident to be useful axioms. If we can’t even say that much with confidence, what’s left to say or teach AI about ethics?
No, my view is that deontology leads to the best results, if I had to pick a single framework. However, I think many frameworks can be helpful in different contexts and they tend to overlap.
Absolutely!
A very important point. Many people’s instrumental preferences today are already strongly influenced by AI, such as recommender and ranking algorithms that train people to be more predictable by preying on our evolved tendencies for lust and hatred—patterns that cause genes to survive while reducing well-being within lived experience. More powerful AI should impinge less on clarity of thought and capacity for decision-making than current implementations, not more.