I was excited to read this, because Nate is a clear writer and a clear thinker, who has a high p(doom) for reasons I don’t entirely understand. This did pay off for me in a brief statement that clarified some of his reasons I hadn’t understood:
Nate said
this is part of what i mean by “i don’t think alignment is all that hard”
my high expectation of doom comes from a sense that there’s lots of hurdles and that humanity will flub at least one (and probably lots)
I find this disturbingly compelling. I hadn’t known Nate thought alignment might be fairly easy. Given that, his pessimism is more relevant to me, since I’m pretty sure alignment is do-able even in the near future.
I’m afraid I found the rest of this convoluted and to make little progress on a contentful discussion.
Let me try to summarize the post in case it’s helpful. None of these are direct quotes
Nate: I think alignment by default is highly unlikely
Ronny: I think alignment by default is highly unlikely
(this somehow took most of the conversation)
Ronny: But we won’t do alignment by default. We’ll do it with RL. Sometimes, when I talk to Quintin, I think we might get working alignment by doing RL and pointing the system at lots of stuff we want it to do. It might reproduce human values accurately enough to do that.
Nate: There are a lot of ways to get anything done. So telling it what you want it to do is probably not going to make it generalize well or actually value the things you value.
Ronny: I agree, but I don’t have a strong argument for it.
…
So in sum I didn’t see any strong argument for it beyond the “lots of ways to get things done, so a value match is unlikely”.
Like Rob and Nate, my intuition says that’s unlikely to work.
The number of ways to get things done is substantially constrained if the system is somehow trained to use human concepts and thinking patterns. So maybe that’s the source of optimism for Quintin and the Shard Theorists? Training on language does seem to substantially constrain a model to use human-like concepts.
I think the bulk of the disagreement is deeper and vaguer. One point of vague disagreement seems to be something like: Theory suggests that alignment is hard. Empirical data (mostly from LLMs) suggests that it’s easy to make AI do what you want. Which do you believe?
Fortunately, I don’t think RL alignment is our only or best option, so I’m not hugely invested in the disagreement as it stands, because both perspectives are primarily thinking about RL alignment. I think We have promising alignment plans with low taxes
I think they’re promising because they’re completely different than RL approaches. More on that in an upcoming post.
I was excited to read this, because Nate is a clear writer and a clear thinker, who has a high p(doom) for reasons I don’t entirely understand. This did pay off for me in a brief statement that clarified some of his reasons I hadn’t understood:
Nate said
I find this disturbingly compelling. I hadn’t known Nate thought alignment might be fairly easy. Given that, his pessimism is more relevant to me, since I’m pretty sure alignment is do-able even in the near future.
I’m afraid I found the rest of this convoluted and to make little progress on a contentful discussion.
Let me try to summarize the post in case it’s helpful. None of these are direct quotes
Nate: I think alignment by default is highly unlikely Ronny: I think alignment by default is highly unlikely (this somehow took most of the conversation) Ronny: But we won’t do alignment by default. We’ll do it with RL. Sometimes, when I talk to Quintin, I think we might get working alignment by doing RL and pointing the system at lots of stuff we want it to do. It might reproduce human values accurately enough to do that. Nate: There are a lot of ways to get anything done. So telling it what you want it to do is probably not going to make it generalize well or actually value the things you value. Ronny: I agree, but I don’t have a strong argument for it. …
So in sum I didn’t see any strong argument for it beyond the “lots of ways to get things done, so a value match is unlikely”.
Like Rob and Nate, my intuition says that’s unlikely to work.
The number of ways to get things done is substantially constrained if the system is somehow trained to use human concepts and thinking patterns. So maybe that’s the source of optimism for Quintin and the Shard Theorists? Training on language does seem to substantially constrain a model to use human-like concepts.
I think the bulk of the disagreement is deeper and vaguer. One point of vague disagreement seems to be something like: Theory suggests that alignment is hard. Empirical data (mostly from LLMs) suggests that it’s easy to make AI do what you want. Which do you believe?
Fortunately, I don’t think RL alignment is our only or best option, so I’m not hugely invested in the disagreement as it stands, because both perspectives are primarily thinking about RL alignment. I think We have promising alignment plans with low taxes
I think they’re promising because they’re completely different than RL approaches. More on that in an upcoming post.