In this post Matthew Barnett notices that we updated our beliefs between ~2007 and ~2023. I say “we” rather than MIRI or “Yudkowsky, Soares, and Bensinger” because I think this was a general update, but also to defuse the defensive reactions I observe in the comments.
What did we change our mind about? Well, in 2007 we thought that safely extracting approximate human values into a convenient format would be impossible. We knew that a superintelligence could do this. But a superintelligence would kill us, so this isn’t helpful. We knew that human values are more complex than fake utility functions or magical categories. So we can’t hard-code human values into a utility function. So we looked for alternatives like corrigibility.
By 2023, we learned that a correctly trained LLM can extract approximate human values without causing human extinction (yet). This post points to GPT-4 as conclusive evidence, which is fair. But GPT-3 was an important update and many people updated then. I imagine that MIRI and other experts figured it out earlier. This update has consequences for plans to avoid extinction or die with more dignity.
Unfortunately much of the initial commentary was defensive, attacking Barnett for claims he did not make. Yudkowsky placed a disclaimer on Hidden Complexity of Wishes implausibly denying that it is an AI parable. This could be surprising. Yudkowsky’s Coming of Age and How to Actually Change Your Mind sequences are excellent. What went wrong?
An underappreciated sub-skill of rationality is noticing that I have, in the past, changed my mind. For me, this is pretty easy when I think back to my teenage years. But I’m in my 40s now, and I find it harder to think of major updates during my 20s and 30s, despite the world (and me) changing a lot in that time. Seeing this pattern of defensiveness in other people made me realize that it’s probably common, and I probably have it to. I wish I had a guide to middle-aged rationality. In middle-age my experience is supposed to be my value-add, but conveniently forgetting my previous beliefs throws some of that away.
Sometimes I read people saying that Yudkowsky-2008 didn’t say this, the post wasn’t about that, and so forth, including Yudkowsky himself. Not with evidence, not with a better reading, just a denial. Perhaps people are overestimating how accurately their brain has maintained a model of what Yudkowsky wrote 10+ years ago. If Alice read the sequences in 2008 and Bob read the sequences in 2024, then Bob has a better model of what the sequences said. Evidence and arguments screen off authority.
But more importantly to me (here come the feelings), these defensive anti-interpretations of the sequences are boring and narrow and ugh. By positing that multiple apparently literate people misread the sequences both at the time (read the old comments), and today (read the new comments), they paint a picture of young Yudkowsky as a bad writer who attracted bad readers.
As I read it, the Hidden Complexity of Wishes is a glorious parable about AI and genies that paints graphic images of failure cases and invites both thought and imagination from the reader. As Yudkowsky-2024 tells me to read it, it is just making a point about the algorithmic complexity of human values. Yeah, I deny that, Death of the Author and all.
Likewise, as I read it, Magical Categories is about, well, categories. Categories that matter for capabilities and for alignment and for humans. It’s part of a network of rationalist thought that has ripples today in discussions about gender, adversarial examples, natural abstractions, and more. As others read it, Magical Categories is always in every instance talking about getting a shape into the AI’s preferences, never some other thing.
In this post Matthew Barnett notices that we updated our beliefs between ~2007 and ~2023. I say “we” rather than MIRI or “Yudkowsky, Soares, and Bensinger” because I think this was a general update, but also to defuse the defensive reactions I observe in the comments.
What did we change our mind about? Well, in 2007 we thought that safely extracting approximate human values into a convenient format would be impossible. We knew that a superintelligence could do this. But a superintelligence would kill us, so this isn’t helpful. We knew that human values are more complex than fake utility functions or magical categories. So we can’t hard-code human values into a utility function. So we looked for alternatives like corrigibility.
By 2023, we learned that a correctly trained LLM can extract approximate human values without causing human extinction (yet). This post points to GPT-4 as conclusive evidence, which is fair. But GPT-3 was an important update and many people updated then. I imagine that MIRI and other experts figured it out earlier. This update has consequences for plans to avoid extinction or die with more dignity.
Unfortunately much of the initial commentary was defensive, attacking Barnett for claims he did not make. Yudkowsky placed a disclaimer on Hidden Complexity of Wishes implausibly denying that it is an AI parable. This could be surprising. Yudkowsky’s Coming of Age and How to Actually Change Your Mind sequences are excellent. What went wrong?
An underappreciated sub-skill of rationality is noticing that I have, in the past, changed my mind. For me, this is pretty easy when I think back to my teenage years. But I’m in my 40s now, and I find it harder to think of major updates during my 20s and 30s, despite the world (and me) changing a lot in that time. Seeing this pattern of defensiveness in other people made me realize that it’s probably common, and I probably have it to. I wish I had a guide to middle-aged rationality. In middle-age my experience is supposed to be my value-add, but conveniently forgetting my previous beliefs throws some of that away.
Epistemic status: minimal. Mostly feelings.
Sometimes I read people saying that Yudkowsky-2008 didn’t say this, the post wasn’t about that, and so forth, including Yudkowsky himself. Not with evidence, not with a better reading, just a denial. Perhaps people are overestimating how accurately their brain has maintained a model of what Yudkowsky wrote 10+ years ago. If Alice read the sequences in 2008 and Bob read the sequences in 2024, then Bob has a better model of what the sequences said. Evidence and arguments screen off authority.
But more importantly to me (here come the feelings), these defensive anti-interpretations of the sequences are boring and narrow and ugh. By positing that multiple apparently literate people misread the sequences both at the time (read the old comments), and today (read the new comments), they paint a picture of young Yudkowsky as a bad writer who attracted bad readers.
As I read it, the Hidden Complexity of Wishes is a glorious parable about AI and genies that paints graphic images of failure cases and invites both thought and imagination from the reader. As Yudkowsky-2024 tells me to read it, it is just making a point about the algorithmic complexity of human values. Yeah, I deny that, Death of the Author and all.
Likewise, as I read it, Magical Categories is about, well, categories. Categories that matter for capabilities and for alignment and for humans. It’s part of a network of rationalist thought that has ripples today in discussions about gender, adversarial examples, natural abstractions, and more. As others read it, Magical Categories is always in every instance talking about getting a shape into the AI’s preferences, never some other thing.
No thanks. Where recursive justification hits bottom is, I read LessWrong with my brain, it’s the only one I have.