This analogy is valid in the case where we have absolutely no idea how to use a system’s representations or “knowledge” to direct an AIs behavior. That is the world Yudkowsky wrote the sequences in. It is not the world we currently live in. There are several, perhaps many, plausible plans to direct a competent AGIs actions and its “thoughts” and “values”′ toward either its own or a subsystem’s “understanding” of human values. See Goals selected from learned knowledge: an alternative to RL alignment for some of those plans. Critiques need to go beyond the old “we have no idea” argument and actually address the ideas we have.
That’s incorrect, but more importantly it’s off topic. The topic is “what does the complexity of value have to do with the difficulty of alignment”. Barnett AFAIK in this comment is not saying (though he might agree, and maybe he should be taken as saying so implicitly or something) “we have lots of ideas for getting an AI to care about some given values”. Rather he’s saying “if you have a simple pointer to our values, then the complexity of values no longer implies anything about the difficulty of alignment because values effectively aren’t complex anymore”.
I’m not sure you could be as confident as Yudkowsky was at the time, but yeah there was a serious probability in the epistemic state of 2008 that human values were so complicated and that simple techniques made AIs so completely goodhart on the task that’s intended that controlling smart AI was essentially hopeless.
We now know that a lot of the old Lesswrong lore on how complicated human values and wishes are, at least in the code section are either incorrect or irrelevant, and we also know that the standard LW story of how humans came to dominate other animals is incorrect to a degree that impacts AI alignment.
I have my own comments on the ideas below, but people really should try to update on the evidence we gained from LLMs, as we learned a lot about ourselves and LLMs in the process, because there’s a lot of evidence that generalizes from LLMs to future AGI/ASI, and IMO LW updated way, way too slowly on AI safety.
This analogy is valid in the case where we have absolutely no idea how to use a system’s representations or “knowledge” to direct an AIs behavior. That is the world Yudkowsky wrote the sequences in. It is not the world we currently live in. There are several, perhaps many, plausible plans to direct a competent AGIs actions and its “thoughts” and “values”′ toward either its own or a subsystem’s “understanding” of human values. See Goals selected from learned knowledge: an alternative to RL alignment for some of those plans. Critiques need to go beyond the old “we have no idea” argument and actually address the ideas we have.
That’s incorrect, but more importantly it’s off topic. The topic is “what does the complexity of value have to do with the difficulty of alignment”. Barnett AFAIK in this comment is not saying (though he might agree, and maybe he should be taken as saying so implicitly or something) “we have lots of ideas for getting an AI to care about some given values”. Rather he’s saying “if you have a simple pointer to our values, then the complexity of values no longer implies anything about the difficulty of alignment because values effectively aren’t complex anymore”.
This.
I’m not sure you could be as confident as Yudkowsky was at the time, but yeah there was a serious probability in the epistemic state of 2008 that human values were so complicated and that simple techniques made AIs so completely goodhart on the task that’s intended that controlling smart AI was essentially hopeless.
We now know that a lot of the old Lesswrong lore on how complicated human values and wishes are, at least in the code section are either incorrect or irrelevant, and we also know that the standard LW story of how humans came to dominate other animals is incorrect to a degree that impacts AI alignment.
I have my own comments on the ideas below, but people really should try to update on the evidence we gained from LLMs, as we learned a lot about ourselves and LLMs in the process, because there’s a lot of evidence that generalizes from LLMs to future AGI/ASI, and IMO LW updated way, way too slowly on AI safety.
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ (This is more of a model-based RL approach to alignment)
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#dyfwgry3gKRBqQzoW
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck