The post seems to make an equivalence between LLMs understanding ethics and caring about ethics, which does not clearly follow (I can study Buddhist ethics without caring about following it). We could cast RLHF as training LLMs into caring about some sort of ethics, but then jailbreaking becomes a bit of a thorny question. Alternatively, why do we assume training the appearance of obedience is enough when you start scaling LLMs?
It’s correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
There are other nitpicks I will drop in short form: why assume “superhuman levels of loyalty” in upgraded LLMs? Why implicitly assume that LLMs will extend ethics correctly? Why do you think mechanistic interpretability is so much more promising than old school AI safetyists do? Why does self-supervision result in rising property values in Tokyo?
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.
I’d agree that the arguments I raise could be addressed (as endless arguments attest) and OP could reasonably end up with a thesis like “LLMs are actually human aligned by default”. Putting my recommendation differently, the lack of even a gesture towards those arguments almost caused me to dismiss the post as unserious and not worth finishing.
I’m somewhat surprised, given OP’s long LW tenure. Maybe this was written for a very different audience and just incidentally posted to LW? Except the linkpost tagline focuses on the 1st part of the post, not the 2nd, implying OP thought this was actually persuasive?! Is OP failing an intellectual Turing test or am I???
I agree with you that it is quite bad that Roko didn’t attempt to do this, and my steelmanning doesn’t change the fact that the original argument is quite bad, and should be shored up.
It’s correct that understanding a value!= caring about the value in the general case, and this definitely should be fixed, but I think the defensible claim here is that the data absolutely influence which values you eventually adopt, and we do have ways to influence what an LLM values just by changing their datasets.
As far as why we should assume superhuman levels of loyalty, the basic answer is that the second species arguments relies on premises that are crucially false for the AI case.
The big reason why gorillas/chimpanzees lost out and got brutally killed by humans when we dominated is because of us being made out of a ridiculously sparse RL process, which means we had barely any alignment effort by evolution or genetically close to human species and more importantly there was no gorilla/chimpanzee alignment effort at all, nor did they have the tools to control what our data sources are, unlike in the AI case where we both have way denser feedback and more control over their data sources, and we also have help from SGD for any inner alignment issue, which is way more powerful as an optimizer than evolution/natural selection, mostly due to not having very exploitable hacks.
I’d agree that the arguments I raise could be addressed (as endless arguments attest) and OP could reasonably end up with a thesis like “LLMs are actually human aligned by default”. Putting my recommendation differently, the lack of even a gesture towards those arguments almost caused me to dismiss the post as unserious and not worth finishing.
I’m somewhat surprised, given OP’s long LW tenure. Maybe this was written for a very different audience and just incidentally posted to LW? Except the linkpost tagline focuses on the 1st part of the post, not the 2nd, implying OP thought this was actually persuasive?! Is OP failing an intellectual Turing test or am I???
I agree with you that it is quite bad that Roko didn’t attempt to do this, and my steelmanning doesn’t change the fact that the original argument is quite bad, and should be shored up.