I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
What we need to achieve [...] The first AI that significantly speeds up alignment research isn’t successfully scheming [...]
The plan is divided into two layers, where the first layer seems absolutely required to me, i.e. any plan that doesn’t include these would very likely yield catastrophically bad results. [...]
Layer 1 [...] Everything in this section seems very important to me [...]
1. We should try hard to keep a paradigm with faithful and human-legible CoT
[...]
4. In both worlds, we should use the other, i.e. control/monitoring, as a second line of defense.
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn’t do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn’t successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don’t think it’s crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don’t think a minimal stable solution involves most of the world’s compute going towards alignment research.
To be clear, it’s quite plausible that we want to do the actions you suggest, because even if they aren’t literally necessary, they can still reduce risk and that is valuable. I’m just objecting to the claim that if we didn’t have any one of them then we very likely get catastrophically bad results.
That’s fair. I think the more accurate way of phrasing this is not “we will get catastrophe” and more “it clearly exceeds the risk threshold I’m willing to take / I think humanity should clearly not take” which is significantly lower than 100% of catastrophe.
I broadly like the actual plan itself (obviously I would have some differences, but it is overall reasonably close to what I would imagine). However, it feels like there is an unwarranted amount of doom mentality here. To give one example:
Suppose that for the first AI that speeds up alignment research, you kept a paradigm with faithful and human-legible CoT, and you monitored the reasoning for bad reasoning / actions, but you didn’t do other kinds of control as a second line of defense. Taken literally, your words imply that this would very likely yield catastrophically bad results. I find it hard to see a consistent view that endorses this position, without also believing that your full plan would very likely yield catastrophically bad results.
(My view is that faithful + human-legible CoT along with monitoring for the first AI that speeds up alignment research, would very likely ensure that AI system isn’t successfully scheming, achieving the goal you set out. Whether there are later catastrophically bad results is still uncertain and depends on what happens afterwards.)
This is the clearest example, but I feel this way about a lot of the rhetoric in this post. E.g. I don’t think it’s crazy to imagine that without SL4 you still get good outcomes even if just by luck, I don’t think a minimal stable solution involves most of the world’s compute going towards alignment research.
To be clear, it’s quite plausible that we want to do the actions you suggest, because even if they aren’t literally necessary, they can still reduce risk and that is valuable. I’m just objecting to the claim that if we didn’t have any one of them then we very likely get catastrophically bad results.
That’s fair. I think the more accurate way of phrasing this is not “we will get catastrophe” and more “it clearly exceeds the risk threshold I’m willing to take / I think humanity should clearly not take” which is significantly lower than 100% of catastrophe.