The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven’t really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists—most everyone focuses on alignment, but there’s a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
It’s correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.
The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
In short, you claim that old school AI safety is wrong, but it seems to me you haven’t really engaged their arguments.
That said, the 2nd part of the post does seem interesting, even for old school AI safetyists—most everyone focuses on alignment, but there’s a lot less focus on what happens after alignment (although nowhere close to none, even >14 years ago; this is another way that the versus AI safety framing does not make sense). Personally, I would recommend splitting up the post; the 2nd part stands by itself and has something new to say, while the 1st part needs way more detail to actually convince old school AI safetyists.
It’s correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.