I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
I think Redwood Research’s recent work on AI control really “hits it out of the park”, and they have identified a tractable and neglected intervention that can make AI go a lot better. Obviously we should shift labor until the marginal unit of research in either area decreases P(doom) by the same amount. I think that implies lots of alignment researchers should shift to AI control type work, and would naively guess that the equilibrium is close to 50⁄50 across people who are reading this post. That means if you’re working on alignment and reading this, I think there’s probably a ~45% chance it would be better for your values if you instead were working on AI control!
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence!
The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”.
But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).
I agree with bullet points 1, 2, 3, 6 and 7, partially agree with bullet point 5, and disagree with bullet point 4.
Thus, I agree with the central claim here:
For more of my analysis on the bullet points, read the rest of the comment.
For bullet point 1, I basically agree with this, mostly due to not favoring binary assumptions and instead prefer continuous quantity reasoning, which tends to both be a better match for the IRL world, and also continuous quantity reasoning gives you more information than binary outcomes.
I really like bullet point 2, and also think that even in a scenario where it’s easy to prevent defection, you should still have controls that make defecting employees have much less reward and much more punishment for subversive actions.
I deeply agree with point 3, and I’d frame AI control in one of 2 ways:
As a replacement for the pivotal act concept.
As a pivotal act that doesn’t require destruction or death, and doesn’t require you to overthrow nations in your quest.
A nitpick: AI labor will be the huge majority of alignment progress in every stage, not just the early stage.
I think one big reason the pivotal act frame dominated a lot of discussions is the assumption that we would get a pure software singularity which would FOOM in several weeks, but reality is shaping up to not be a pure software-singularity, since physical stuff like robotics and data centers still matters.
There’s a reason why every hyperscaler is trying to get large amounts of power and datacenter compute contracts, because they realize that the singularity is bottlenecked currently on power and to a lesser extent compute.
I disagree with 4, but that’s due to my views on alignment, which tend to view it as a significantly easier problem than the median LWer does, and in particular I view essentially 0 need for philosophical deconfusion to make the future go well.
I agree that AI control enhances alignment arguments universally, and provides more compelling stories. I disagree with the assumption that all alignment plans depend on dubious philosophical assumptions about the nature of cognition/SGD.
I definitely agree with bullet point 6 that superhuman savant AI could well play a big part in the intelligence explosion, and I believe this most for formal math theorem provers/AI coders.
Agree with bullet point 7, and think it would definitely be helpful if people focused more on “how can this be helpful even if it fails to produce an aligned AI?”
I feel like our viewpoints have converged a lot over the past couple years Noosphere. Which I suppose makes sense, since we’ve both been updating on similar evidence! The one point I’d disagree with, although also wanting to point out that the disagreement seems irrelevant to short term strategy, is that I do think that philosophy and figuring out values is going to be pretty key in getting from a place of “shakey temporary safety” to a place of “long-term stable safety”. But I think our views on the sensible next steps to get to that initial at-least-reasonable-safety sound quite similar.
Since I’m pretty sure we’re currently in a quite fragile place as a species, I think it’s worth putting off thinking about long term safety (decades) to focus on short/medium term safety (months/years).