I disagree with point 4; I wouldn’t say that means “the alignment problem is solved” in any meaningful way, because:
what works with chatGPT will likely be much harder to get to working with smarter agents, and
RLHF doesn’t “work” with chatGPT for the purposes of what’s discussed here. If you can jailbreak it with something as simple as DAN, then it’s a very thin barrier.
I agree with the rest of your points and don’t think this would be an existential danger, but not because I trust these hypothetical systems to just say “no, bad human!” to anyone trying to get them to do something dangerous with a modicum of cleverness.
Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it’s probably true that in general larger models are harder to train, timewise and resource-wise, it’s untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense—that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don’t think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
You can’t really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won’t tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it’s a movie or a show, we tend to be less critical than in real life. For example we could think it’s cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can’t distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this “movie” level, that is more dangerous than usual, but it’s by a very limited margin.
I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions… it’s currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it’s now, a chatbot with API, I think it’s good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn’t came with safety belts, we can’t invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can’t be blamed for “hiding something from us” or “develop harmful intentions”, at least as long as they don’t themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it’s not really different than the internet. The fear here that those models will start to harm humans, our of their own “self interest”, and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it’s dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don’t drive in everywhere ’cause we can and it’s fun.
There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I’ve wrote an opinion article, explaining my views on the topic: AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)
I disagree with point 4; I wouldn’t say that means “the alignment problem is solved” in any meaningful way, because:
what works with chatGPT will likely be much harder to get to working with smarter agents, and
RLHF doesn’t “work” with chatGPT for the purposes of what’s discussed here. If you can jailbreak it with something as simple as DAN, then it’s a very thin barrier.
I agree with the rest of your points and don’t think this would be an existential danger, but not because I trust these hypothetical systems to just say “no, bad human!” to anyone trying to get them to do something dangerous with a modicum of cleverness.
Regarding larger models:
Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it’s probably true that in general larger models are harder to train, timewise and resource-wise, it’s untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense—that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don’t think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
You can’t really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won’t tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it’s a movie or a show, we tend to be less critical than in real life. For example we could think it’s cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can’t distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this “movie” level, that is more dangerous than usual, but it’s by a very limited margin.
I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions… it’s currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it’s now, a chatbot with API, I think it’s good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn’t came with safety belts, we can’t invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can’t be blamed for “hiding something from us” or “develop harmful intentions”, at least as long as they don’t themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it’s not really different than the internet. The fear here that those models will start to harm humans, our of their own “self interest”, and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it’s dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don’t drive in everywhere ’cause we can and it’s fun.
There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I’ve wrote an opinion article, explaining my views on the topic:
AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)