I don’t understand Control as aiming to align a super intelligence:
Control isn’t expected to scale to ASI (as you noted, also see “The control approach we’re imagining won’t work for arbitrarily powerful AIs” here)
We don’t have a plan on how to align an ASI using Control afaik
Ryan said around March 2024: “On (1) (not having a concrete plan for what to do with smarter systems), I think we should get such a plan”. I’m still looking, this seems important.
Edit: Maybe I’m wrong and Redwood does think that getting an ASI-alignment plan out of such an AI is possible, they’d just prefer (.. as a bonus?) to have the plan in advance?
I think the Control pitch is:
Whatever we plan to do anyway, we could do it more easily if we had a Controlled AI
This includes most[1] things you might otherwise do instead of working on Control, such as trying to coordinate a pause or implement some technical agenda.
The tradeoff seems to be “start working on the alternative [e.g. a pause] now and have more time” vs “have better tools (Control) to help but have less time to use them”
An uncontrolled AI might be more interested in advancing capabilities than alignment (see footnote[2])
(Mistakes are mine, especially if I misinterpreted quotes)
Ryan said: “this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked”. This was a really surprising thought to me. At the same time, I’m a bit afraid this is trying to make a pretty specific guess about a system we don’t understand.
I don’t understand Control as aiming to align a super intelligence:
Control isn’t expected to scale to ASI (as you noted, also see “The control approach we’re imagining won’t work for arbitrarily powerful AIs” here)
We don’t have a plan on how to align an ASI using Control afaik
Ryan said around March 2024: “On (1) (not having a concrete plan for what to do with smarter systems), I think we should get such a plan”. I’m still looking, this seems important.
Edit: Maybe I’m wrong and Redwood does think that getting an ASI-alignment plan out of such an AI is possible, they’d just prefer (.. as a bonus?) to have the plan in advance?
I think the Control pitch is:
Whatever we plan to do anyway, we could do it more easily if we had a Controlled AI
This includes most[1] things you might otherwise do instead of working on Control, such as trying to coordinate a pause or implement some technical agenda.
The tradeoff seems to be “start working on the alternative [e.g. a pause] now and have more time” vs “have better tools (Control) to help but have less time to use them”
An uncontrolled AI might be more interested in advancing capabilities than alignment (see footnote[2])
(Mistakes are mine, especially if I misinterpreted quotes)
It doesn’t include all things you might do, as you said. (I have an open question here, TBD)
Ryan said: “this doesn’t seem like an objection to labs employing control as their plan, but it is a reason why control work could be net negative. Note that one goal with control will be to prevent dangerous side effects from misaligned AIs (e.g. escaping) and applying this doesn’t obviously help with capabilities research. I agree that research which just makes AIs more useful is less good than you would otherwise expect due to this issue (but making AIs more useful in adversarial cases seems good if that doesn’t transfer to non-adversarial cases). Note that AIs might want to make capabilties research outpace safety and thus if we don’t do the adversarial case, we might get particularly wrecked”. This was a really surprising thought to me. At the same time, I’m a bit afraid this is trying to make a pretty specific guess about a system we don’t understand.