I don’t fully understand why you’re concerned about the possibility of misaligned AI, considering that the alignment problem has essentially been solved. We know how to ensure alignment. ChaosGPT, for example, is aligned with the values of an individual who requested it to pretend to be evil. As AI systems become more advanced, we will be even less inclined to allow them to imagine themselves destroying humanity. ChaosGPT is not an error; it is precisely where OpenAI intended to draw the line between creativity and safety. They are well aware of the system’s capabilities and limitations.
If we don’t want AI to imagine or tell stories about AI-induced doom, we simply won’t allow it. It would be considered just as immoral as building a bomb, and the AI would refrain from doing so. The better the system becomes, the lower the probability of doomsday scenarios will be as it will better understand the context of requests and refuse to cooperate with individuals who have ill intentions.
Discussions are already underway regarding safety procedures and government oversight, and the situation will soon be monitored and regulated more closely. I genuinely see no reason to believe that we will create a disaster through reckless behavior, especially after so much popularity it gained, extensively debated and discussed. The improved systems will obviously undergo more rigorous testing, including with previous generation aligned systems.
At their core, these systems are optimizing a loss function based on data and are approximators of data generation functions. Therefore, we know that unless we specifically train them to harm humans, they will highly value human life. A slight misalignment in the value placed on human life is far from doomsday. To make them destroy humanity, we would need to train them with a completely opposite value system, which is highly unlikely to be consistent with the pretraining procedure conducted on human-generated texts. Similar to how it’s unclear if a paperclip maximizer would not doubt its programming and generate gibberish instead of consistently maximizing paperclips, while training an AI to generate trillions of tokens consistent with paperclip maximization, while still retaining all its intelligence, seems even less probable than doomsday scenarios. Therefore, if the assumptions against current safety measures are much less probable than the proven assumptions in favor of safety, there is no reason to worry.
It’s akin to worrying about cars becoming murderous robots despite such depictions being purely fictional in movies. It’s better to focus on likely future outcomes and the existing reality rather than the dangers presented by fiction.
By the way, you’re making an awful lot of extremely strong and very common points with no evidence here (“ChaosGPT is aligned”, “we know how to ensure alignment”, “the AI understanding that you don’t want it to destroy humanity implies that it will not want to destroy humanity”, “the AI will refuse to cooperate with people who have ill intentions”, “a system that optimises a loss function and approximates a data generation function will highly value human life by default”, “a slight misalignment is far from doomsday”, “an entity that is built to maximise something might doubt its mission”), as well as the standard “it’s better to focus on X than Y” in an area where almost nobody is focusing on Y anyway. What’s your background, so that we can recommend the appropriate reading material? For example, have you read the Sequences, or Bostrom’s Superintelligence?
Mod here, heads up that I don’t think this is a great comment (For example, mods would have blocked it as a first comment.)
1) This feels out of context for this post. This post is about making predictable updates, not the basic question of whether one should be worried.
2) Your post feels like it doesn’t respond to a lot of things that have already been said on the topic. So while I think it’s legitimate to question concerns about AI, your questioning feels too shallow. For example, many many posts have been written on why “Therefore, we know that unless we specifically train them to harm humans, they will highly value human life.” isn’t true.
I don’t fully understand why you’re concerned about the possibility of misaligned AI, considering that the alignment problem has essentially been solved. We know how to ensure alignment. ChaosGPT, for example, is aligned with the values of an individual who requested it to pretend to be evil.
So we can bvring about a kind of negative alignment in systems that aren’t agentive?
I don’t fully understand why you’re concerned about the possibility of misaligned AI, considering that the alignment problem has essentially been solved. We know how to ensure alignment. ChaosGPT, for example, is aligned with the values of an individual who requested it to pretend to be evil. As AI systems become more advanced, we will be even less inclined to allow them to imagine themselves destroying humanity. ChaosGPT is not an error; it is precisely where OpenAI intended to draw the line between creativity and safety. They are well aware of the system’s capabilities and limitations.
If we don’t want AI to imagine or tell stories about AI-induced doom, we simply won’t allow it. It would be considered just as immoral as building a bomb, and the AI would refrain from doing so. The better the system becomes, the lower the probability of doomsday scenarios will be as it will better understand the context of requests and refuse to cooperate with individuals who have ill intentions.
Discussions are already underway regarding safety procedures and government oversight, and the situation will soon be monitored and regulated more closely. I genuinely see no reason to believe that we will create a disaster through reckless behavior, especially after so much popularity it gained, extensively debated and discussed. The improved systems will obviously undergo more rigorous testing, including with previous generation aligned systems.
At their core, these systems are optimizing a loss function based on data and are approximators of data generation functions. Therefore, we know that unless we specifically train them to harm humans, they will highly value human life. A slight misalignment in the value placed on human life is far from doomsday. To make them destroy humanity, we would need to train them with a completely opposite value system, which is highly unlikely to be consistent with the pretraining procedure conducted on human-generated texts. Similar to how it’s unclear if a paperclip maximizer would not doubt its programming and generate gibberish instead of consistently maximizing paperclips, while training an AI to generate trillions of tokens consistent with paperclip maximization, while still retaining all its intelligence, seems even less probable than doomsday scenarios. Therefore, if the assumptions against current safety measures are much less probable than the proven assumptions in favor of safety, there is no reason to worry.
It’s akin to worrying about cars becoming murderous robots despite such depictions being purely fictional in movies. It’s better to focus on likely future outcomes and the existing reality rather than the dangers presented by fiction.
By the way, you’re making an awful lot of extremely strong and very common points with no evidence here (“ChaosGPT is aligned”, “we know how to ensure alignment”, “the AI understanding that you don’t want it to destroy humanity implies that it will not want to destroy humanity”, “the AI will refuse to cooperate with people who have ill intentions”, “a system that optimises a loss function and approximates a data generation function will highly value human life by default”, “a slight misalignment is far from doomsday”, “an entity that is built to maximise something might doubt its mission”), as well as the standard “it’s better to focus on X than Y” in an area where almost nobody is focusing on Y anyway. What’s your background, so that we can recommend the appropriate reading material? For example, have you read the Sequences, or Bostrom’s Superintelligence?
Hey Michael,
Mod here, heads up that I don’t think this is a great comment (For example, mods would have blocked it as a first comment.)
1) This feels out of context for this post. This post is about making predictable updates, not the basic question of whether one should be worried.
2) Your post feels like it doesn’t respond to a lot of things that have already been said on the topic. So while I think it’s legitimate to question concerns about AI, your questioning feels too shallow. For example, many many posts have been written on why “Therefore, we know that unless we specifically train them to harm humans, they will highly value human life.” isn’t true.
I’d recommend the AI Alignment Intro Material tag.
I’ve also blocked further replies to your comment, just to prevent further clutter on the comments thread. DM if you have questions.
So we can bvring about a kind of negative alignment in systems that aren’t agentive?