In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).
And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.
In your toy model there’s 100% chance that we’re doomed.
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
For what it’s worth, I’ve had a similar discussion with John in another comment thread where he said that he doesn’t believe the probability of doom is 1, he just believes it’s some p≫0 that doesn’t depend too much on the time we have to work on problems past a time horizon of 1 week or so.
This is consistent with your model and so I don’t think John actually believes that the probability of doom is 1 and I don’t think he would necessarily disagree with your model either. On the other hand in your model the probability of doom asymptotes to some p≫0 as extra time goes to infinity, so it’s also not true that extra time would be very helpful in this situation past a certain point.
TBC, I believe that the value of more time rapidly asymptotes specifically for the purpose of finding problems by trying things and seeing what goes wrong. More time is still valuable for progress via other channels.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).
And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
Quick responses to your argument for (iii).
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
For what it’s worth, I’ve had a similar discussion with John in another comment thread where he said that he doesn’t believe the probability of doom is 1, he just believes it’s some p≫0 that doesn’t depend too much on the time we have to work on problems past a time horizon of 1 week or so.
This is consistent with your model and so I don’t think John actually believes that the probability of doom is 1 and I don’t think he would necessarily disagree with your model either. On the other hand in your model the probability of doom asymptotes to some p≫0 as extra time goes to infinity, so it’s also not true that extra time would be very helpful in this situation past a certain point.
TBC, I believe that the value of more time rapidly asymptotes specifically for the purpose of finding problems by trying things and seeing what goes wrong. More time is still valuable for progress via other channels.