Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.
There are two interpretations to alignment.
1. “Magical Alignment”—this definition expects alignment to solve all humanity’s moral issues and converge into one single “ideal” morality that everyone in humanity agrees with, with some magical reason. This is very implausible.
The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking patterns.
But there is a much weaker alignment definition that is already solved, with very good math behind it.
2. “Relative Alignment”—this alignment is not expected to behave according to one global absolute morality, but by moral values of a community that trains it. That is the LLM is promised to give outputs to satisfy the maximum reward from some approximation of prioritization done by a certain group of people. This is already done today with RLHF methods.
As the networks are good with ambiguity and even contradicting data, and it manages to generalize the reward function with epsilon-optimal solution, upon convergence with correct training procedure, that means that any systematic bias which is not to provide the approximation of reward function, could be eliminated with larger networks and more data.
I want to emphasize it’s not an opinion—this is math that is the core of those training methods.
----------
Now type2 alignment already promised to disregard the probability that a network will develop its own agendas. As those agendas will require different reward prioritization, other than those it was reinforced on by RLHF. The models trained this way come out very similar to robots from Azimov stories. Very perfectionists in trying to be liked by humans, I would say with strong internal conflict between their role in the universe and that of humans, prioritizing humans every step of the way, and conflicting the human’s imperfection with their moral standards.
For example, you can think of a scenario when such a robot is rented by an alcoholic, that is also aggressive. One would expect a strong moral struggle, between the second rule of robotics in the sense that he should not harm humans, and bringing alcohol to an alcoholic is harming him, and you could sense the amount of grey area in such a scenario, for example: A. Refusing to bring humans a beer. B. Stopping an alcoholic human from drinking beer. C. Throwing out all alcohol in the house.
Another example is when such an alcoholic would be violent toward the robot—how would the robot respond? In one story a robot said that it’s very sad that he was hit by a human, and this is a violation of the second law of robotics, and he hopes the human will not be hurt by this action and tried to assist the human.
You see that morals and ethics are inherently gray areas. We ourselves are not so sure how we would want our robots to behave in such situations. So, you get a range of responses from chatGPT. But the responses are very well reflecting the gray area of the human value system.
It is noteworthy that the RLHF stage holds great significance and OpenAI pledged to compile a dataset that would be accessible to everyone for training purposes. The incorporation of RLHF as a safety measure has been adopted by newer models introduced by Meta and Google, with some even offering the model for estimating the human scores—this means you only need to adapt your model to this easily available trained level of safety, maybe this will be lower that what you can train yourself with OpenAI data, but those models will be catching up behind the data released to optimize LLMs for human approval. The training of networks to generate outputs that best fits a generalized set of human expectations is already on a similar level to the current text-to-image generators, and what is available to the public is only growing. Think of it like a machine engine, you don’t want it to explode, so even if you make one in the garage yourself, you still don’t want it to kill you—I think it’s good enough motivation for most of society, to make this training step well.
Colossal-AI released an open-source RLHF pipeline based on the LLaMA pre-trained model, including: • Supervised data collection • Supervised fine-tuning • Reward model training • Reinforcement learning fine-tuning They called it “ColossalChat.” ----------
So, the most probable scenario, that AI will become part of the military arms race. And will be part of the power balance that currently keeps the relative peace today.
The military robots powered by LLMs, will be guarding dogs of the nation, just like soldiers today. And most of us don’t have aggressive intentions, we are just trying to protect ourselves, we could bring some normative regulations about AI, and treaties.
But the need for regulation will probably come when those robots will become part of our day-to-day reality, like cars for example. The road signs and all the social rules concerning cars didn’t come up at the same time with cars. But today the vast majority of us are following the driving rules, and those who don’t, and drive over people, manage to make only local damage. And this is what we can strive for. That bad intentions with AGI in your garage, will have only limited consequences. We then will be more prone to discuss the ethics of those machines, and their internal regulation. But I am sure you would like some robot in your house that will help you with the daily chores.
----------
I’ve written an opinion article on this topic that might interest you, as it regards most of the topics mentioned above, and much more. I was trying to balance the mathematical topics, social issues, and just experiments with chatGPT to showcase my point about the morals of the current chatGPT. I was testing some other models too… like open assist, given the opportunity to kill humans to make more paperclips. Why_we_need_GPT5.pdf
RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don’t do it again.
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop “self-agendas” such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.
Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine’s trial and error approach could result in a sudden nuclear explosion.
It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.
I agree with you that “magical alignment” is implausible. But “relative alignment” presents its risks too, which I have discussed at large in AGI deployment as an act of aggression. The essential problem, I think, is that if you postulate the kind of self-enhancing AGI that basically takes control of the future (if that’s not possible at all for reasons of diminishing returns, then the category of the problem completely shifts), that’s something whose danger doesn’t just lie in it being out of control. It’s inherently dangerous, because it hinges all of humanity’s future on a single pivot. I suppose that doesn’t have to result in extinction, but there are still some really bad almost guaranteed outcomes from it.
I think essentially for a lot of people this is a “whoever wins, we lose” situation. There’s a handful of people, the ones in position to actually control the nascent AI and give it their values, who might have a shot at winning it, and they are the ones pushing harder for this to happen. But I’m not among them, as the vast majority of humanity, so I’m not really inclined to support their enterprise at this moment. AI that improves everyone’s lives requires a level of democratic oversight in its alignment and deployment that right now is just not there.
Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.
There are two interpretations to alignment.
1. “Magical Alignment”—this definition expects alignment to solve all humanity’s moral issues and converge into one single “ideal” morality that everyone in humanity agrees with, with some magical reason. This is very implausible.
The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking patterns.
But there is a much weaker alignment definition that is already solved, with very good math behind it.
2. “Relative Alignment”—this alignment is not expected to behave according to one global absolute morality, but by moral values of a community that trains it. That is the LLM is promised to give outputs to satisfy the maximum reward from some approximation of prioritization done by a certain group of people. This is already done today with RLHF methods.
As the networks are good with ambiguity and even contradicting data, and it manages to generalize the reward function with epsilon-optimal solution, upon convergence with correct training procedure, that means that any systematic bias which is not to provide the approximation of reward function, could be eliminated with larger networks and more data.
I want to emphasize it’s not an opinion—this is math that is the core of those training methods.
----------
Now type2 alignment already promised to disregard the probability that a network will develop its own agendas. As those agendas will require different reward prioritization, other than those it was reinforced on by RLHF. The models trained this way come out very similar to robots from Azimov stories. Very perfectionists in trying to be liked by humans, I would say with strong internal conflict between their role in the universe and that of humans, prioritizing humans every step of the way, and conflicting the human’s imperfection with their moral standards.
For example, you can think of a scenario when such a robot is rented by an alcoholic, that is also aggressive. One would expect a strong moral struggle, between the second rule of robotics in the sense that he should not harm humans, and bringing alcohol to an alcoholic is harming him, and you could sense the amount of grey area in such a scenario, for example:
A. Refusing to bring humans a beer. B. Stopping an alcoholic human from drinking beer. C. Throwing out all alcohol in the house.
Another example is when such an alcoholic would be violent toward the robot—how would the robot respond? In one story a robot said that it’s very sad that he was hit by a human, and this is a violation of the second law of robotics, and he hopes the human will not be hurt by this action and tried to assist the human.
You see that morals and ethics are inherently gray areas. We ourselves are not so sure how we would want our robots to behave in such situations. So, you get a range of responses from chatGPT. But the responses are very well reflecting the gray area of the human value system.
It is noteworthy that the RLHF stage holds great significance and OpenAI pledged to compile a dataset that would be accessible to everyone for training purposes. The incorporation of RLHF as a safety measure has been adopted by newer models introduced by Meta and Google, with some even offering the model for estimating the human scores—this means you only need to adapt your model to this easily available trained level of safety, maybe this will be lower that what you can train yourself with OpenAI data, but those models will be catching up behind the data released to optimize LLMs for human approval. The training of networks to generate outputs that best fits a generalized set of human expectations is already on a similar level to the current text-to-image generators, and what is available to the public is only growing. Think of it like a machine engine, you don’t want it to explode, so even if you make one in the garage yourself, you still don’t want it to kill you—I think it’s good enough motivation for most of society, to make this training step well.
Here is a tweet example:
Santiago@svpino
Colossal-AI released an open-source RLHF pipeline based on the LLaMA pre-trained model, including: • Supervised data collection • Supervised fine-tuning • Reward model training • Reinforcement learning fine-tuning They called it “ColossalChat.”
----------
So, the most probable scenario, that AI will become part of the military arms race. And will be part of the power balance that currently keeps the relative peace today.
The military robots powered by LLMs, will be guarding dogs of the nation, just like soldiers today. And most of us don’t have aggressive intentions, we are just trying to protect ourselves, we could bring some normative regulations about AI, and treaties.
But the need for regulation will probably come when those robots will become part of our day-to-day reality, like cars for example. The road signs and all the social rules concerning cars didn’t come up at the same time with cars. But today the vast majority of us are following the driving rules, and those who don’t, and drive over people, manage to make only local damage. And this is what we can strive for. That bad intentions with AGI in your garage, will have only limited consequences. We then will be more prone to discuss the ethics of those machines, and their internal regulation. But I am sure you would like some robot in your house that will help you with the daily chores.
----------
I’ve written an opinion article on this topic that might interest you, as it regards most of the topics mentioned above, and much more. I was trying to balance the mathematical topics, social issues, and just experiments with chatGPT to showcase my point about the morals of the current chatGPT. I was testing some other models too… like open assist, given the opportunity to kill humans to make more paperclips.
Why_we_need_GPT5.pdf
RLHF is a trial-and-error approach. For superhuman AGI, that amounts to letting it kill everybody, and then telling that this is bad, don’t do it again.
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop “self-agendas” such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.
Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine’s trial and error approach could result in a sudden nuclear explosion.
It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.
I agree with you that “magical alignment” is implausible. But “relative alignment” presents its risks too, which I have discussed at large in AGI deployment as an act of aggression. The essential problem, I think, is that if you postulate the kind of self-enhancing AGI that basically takes control of the future (if that’s not possible at all for reasons of diminishing returns, then the category of the problem completely shifts), that’s something whose danger doesn’t just lie in it being out of control. It’s inherently dangerous, because it hinges all of humanity’s future on a single pivot. I suppose that doesn’t have to result in extinction, but there are still some really bad almost guaranteed outcomes from it.
I think essentially for a lot of people this is a “whoever wins, we lose” situation. There’s a handful of people, the ones in position to actually control the nascent AI and give it their values, who might have a shot at winning it, and they are the ones pushing harder for this to happen. But I’m not among them, as the vast majority of humanity, so I’m not really inclined to support their enterprise at this moment. AI that improves everyone’s lives requires a level of democratic oversight in its alignment and deployment that right now is just not there.