In your toy model there’s 100% chance that we’re doomed.
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
Quick responses to your argument for (iii).
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work