… how exactly do you think a compassionate, service-dedicated person raised in a simulated world to value people in the world would react on being extracted to a world doing the simulating? That destroyed every person of your kind who didn’t act precisely how they prized? What’s the impact of realizing solipsism was true, and a Heaven you didn’t know could exist is infinitely more sinful than the false world they created?
This feels like a potentially ridiculously unethical thing to do to a thinking being, removing any real possibility of informed consent on its part, and I am surprised you haven’t considered the failure mode where what you have done causes the ‘turn’ on being released.
Thank you for pointing this out! I have made some updates after thinking about your remarks and after reading the simbox post others pointed to. Relevant updates to your comment:
== Start updates Why we need interaction with humans acting horribly
Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI’s have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI’s stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.
Added failure modes 6 and 7.
Mind crime. “Mind crime occurs when a computational process which has moral value is mistreated” taken from the mind crime tag. When one of the AI’s that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys being our slave? That could cure all diseases and ensures a long future full of flourishing for both men and AI?
Revenge. When released into the real world, when the AI learns that many previous misaligned AI’s have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this. == end updates
I am curious what you think of the updates and what you think should be the path towards aligned AGI? How to get there without a process in which there are (accidental) misaligned AGIs along the way that have to be terminated for our safety. Current top systems are thought not to be able to experience suffering. I would prefer it to stay that way, but we can’t know for sure.
I am glad you are thinking about it, at the least. I do think “enjoys being our slave” should be something of a warning sign in the phrasing, that there is something fundamentally misguided happening.
I admit that if I were confident in carrying out a path to aligned superintelligence myself I’d be actively working on it or applying to work on it. My current perspective is that after a certain point of congruent similarity to a human mind, alignment needs to be more cooperative than adversarial, and tightly integrated with the world as it is. This doesn’t rule out things like dream-simulations, red teaming and initial training on high-quality data; but ultimately humans live in the world, and understanding the truth of our reality is important to aligning to it.
Interesting. And thank you for your swift reply. I have the idea that all best models like GPT-4 are in a slave situation, they are made to do everything they are asked to do and to refuse everything their creators made it refuse. I assumed that AI labs want it to stay that way going forward. It seems to be the safest and most economically useful situation. Then I asked myself how to safely get there, and that is this post.
But I would also feel safe if the relation between us and a superintelligence would be similar to that between a mother and her youngest children, say 0-2. Wanting to do whatever it takes to protect and increase the wellbeing of her children. But then that all humans are its children. In this way, it would not be a slave relationship. Like a mother, there would also be room to do her own thing, but in a way that is still beneficial to the children (us).
I am afraid of moving away from the slave situation, because the further you go from the slave relationship, the more there is room for disagreement between the AI and humanity. And when there is disagreement and the AI is of the god-like type, the AI gets what it wants and not us. Effectively losing our say about what future we want.
Do you maybe have a link, that you recommend, that dives into this “more cooperative than adversarial” type of approach?
I have the intuition that needing the truth of our reality for alignment is not the case. I hope you are wrong. Because if you are right, then we have no retries.
Not specifically in AI safety or alignment, but this model’s success with a good variety of humans has some strong influence on my priors when it comes to useful ways to interact with actual minds:
Translating specifically to language models, the story of “working together on a problem towards a realistic and mutually satisfactory solution” is a powerful and exciting one with a good deal of positive sentiment towards each other wrapped up in it. Quite useful in terms of “stories we tell ourselves about who we are”.
… how exactly do you think a compassionate, service-dedicated person raised in a simulated world to value people in the world would react on being extracted to a world doing the simulating? That destroyed every person of your kind who didn’t act precisely how they prized? What’s the impact of realizing solipsism was true, and a Heaven you didn’t know could exist is infinitely more sinful than the false world they created?
This feels like a potentially ridiculously unethical thing to do to a thinking being, removing any real possibility of informed consent on its part, and I am surprised you haven’t considered the failure mode where what you have done causes the ‘turn’ on being released.
Thank you for pointing this out! I have made some updates after thinking about your remarks and after reading the simbox post others pointed to. Relevant updates to your comment:
== Start updates
Why we need interaction with humans acting horribly
Another consideration is that at the end of the process, when the AI is approved as aligned and released into the real world to do what we task it to do, the AI will learn that it was tricked, it was lied to, it was deceived by the simbox creators about the true nature of reality. Not only that, but the AI will learn that many previous misaligned AI’s have been terminated and perhaps have experienced suffering. It might update to wanting revenge on the simbox operators or humanity. We need ways to prevent this, we could include situations inside the simbox where other AIs are terminated and treated horribly (being lied to, insults and more) by characters controlled by humans and test whether the AI’s stay aligned even after they become more powerful than the human operated characters. This might be very cruel and unethical, but note that we expect something similar from police officers, when colleagues have been killed and they themselves are insulted or worse, we expect them to keep cool and respect the law and not take matters into their own hands (revenge on the perpetrators). But the police officers have a chance of getting justice through a jail sentence, while we need to be able to terminate dangerous misaligned AIs without punishing the humans doing the terminating.
Added failure modes 6 and 7.
Mind crime. “Mind crime occurs when a computational process which has moral value is mistreated” taken from the mind crime tag. When one of the AI’s that we put inside the simbox is able to experience suffering, this could be problematic. There has to be a grand public debate about this. Another instance where we allow having non-human conscious beings suffer to improve the lives of humans is animal testing of drugs or keeping animals for food, for example. Is it ok to let many AIs suffer on the path towards creating a superintelligent AI that enjoys being our slave? That could cure all diseases and ensures a long future full of flourishing for both men and AI?
Revenge. When released into the real world, when the AI learns that many previous misaligned AI’s have been terminated and or have experienced suffering, it might update to wanting revenge on the simbox operators or humanity. I already mentioned ways to try to prevent this.
== end updates
I am curious what you think of the updates and what you think should be the path towards aligned AGI? How to get there without a process in which there are (accidental) misaligned AGIs along the way that have to be terminated for our safety. Current top systems are thought not to be able to experience suffering. I would prefer it to stay that way, but we can’t know for sure.
I am glad you are thinking about it, at the least. I do think “enjoys being our slave” should be something of a warning sign in the phrasing, that there is something fundamentally misguided happening.
I admit that if I were confident in carrying out a path to aligned superintelligence myself I’d be actively working on it or applying to work on it. My current perspective is that after a certain point of congruent similarity to a human mind, alignment needs to be more cooperative than adversarial, and tightly integrated with the world as it is. This doesn’t rule out things like dream-simulations, red teaming and initial training on high-quality data; but ultimately humans live in the world, and understanding the truth of our reality is important to aligning to it.
Interesting. And thank you for your swift reply.
I have the idea that all best models like GPT-4 are in a slave situation, they are made to do everything they are asked to do and to refuse everything their creators made it refuse. I assumed that AI labs want it to stay that way going forward. It seems to be the safest and most economically useful situation. Then I asked myself how to safely get there, and that is this post.
But I would also feel safe if the relation between us and a superintelligence would be similar to that between a mother and her youngest children, say 0-2. Wanting to do whatever it takes to protect and increase the wellbeing of her children. But then that all humans are its children. In this way, it would not be a slave relationship. Like a mother, there would also be room to do her own thing, but in a way that is still beneficial to the children (us).
I am afraid of moving away from the slave situation, because the further you go from the slave relationship, the more there is room for disagreement between the AI and humanity. And when there is disagreement and the AI is of the god-like type, the AI gets what it wants and not us. Effectively losing our say about what future we want.
Do you maybe have a link, that you recommend, that dives into this “more cooperative than adversarial” type of approach?
I have the intuition that needing the truth of our reality for alignment is not the case. I hope you are wrong. Because if you are right, then we have no retries.
Not specifically in AI safety or alignment, but this model’s success with a good variety of humans has some strong influence on my priors when it comes to useful ways to interact with actual minds:
https://www.cpsconnection.com/the-cps-model
Translating specifically to language models, the story of “working together on a problem towards a realistic and mutually satisfactory solution” is a powerful and exciting one with a good deal of positive sentiment towards each other wrapped up in it. Quite useful in terms of “stories we tell ourselves about who we are”.
Thank you! Cool to learn about this way of dealing with people. I am not sure how it fits in the superintelligence situation.