I’m not sure you could be as confident as Yudkowsky was at the time, but yeah there was a serious probability in the epistemic state of 2008 that human values were so complicated and that simple techniques made AIs so completely goodhart on the task that’s intended that controlling smart AI was essentially hopeless.
We now know that a lot of the old Lesswrong lore on how complicated human values and wishes are, at least in the code section are either incorrect or irrelevant, and we also know that the standard LW story of how humans came to dominate other animals is incorrect to a degree that impacts AI alignment.
I have my own comments on the ideas below, but people really should try to update on the evidence we gained from LLMs, as we learned a lot about ourselves and LLMs in the process, because there’s a lot of evidence that generalizes from LLMs to future AGI/ASI, and IMO LW updated way, way too slowly on AI safety.
This.
I’m not sure you could be as confident as Yudkowsky was at the time, but yeah there was a serious probability in the epistemic state of 2008 that human values were so complicated and that simple techniques made AIs so completely goodhart on the task that’s intended that controlling smart AI was essentially hopeless.
We now know that a lot of the old Lesswrong lore on how complicated human values and wishes are, at least in the code section are either incorrect or irrelevant, and we also know that the standard LW story of how humans came to dominate other animals is incorrect to a degree that impacts AI alignment.
I have my own comments on the ideas below, but people really should try to update on the evidence we gained from LLMs, as we learned a lot about ourselves and LLMs in the process, because there’s a lot of evidence that generalizes from LLMs to future AGI/ASI, and IMO LW updated way, way too slowly on AI safety.
https://www.lesswrong.com/posts/83TbrDxvQwkLuiuxk/?commentId=BxNLNXhpGhxzm7heg
https://www.lesswrong.com/posts/YyosBAutg4bzScaLu/thoughts-on-ai-is-easy-to-control-by-pope-and-belrose#4yXqCNKmfaHwDSrAZ (This is more of a model-based RL approach to alignment)
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#dyfwgry3gKRBqQzoW
https://www.lesswrong.com/posts/wkFQ8kDsZL5Ytf73n/my-disagreements-with-agi-ruin-a-list-of-lethalities#7bvmdfhzfdThZ6qck