Corollary: alignment is not importantly easier in slow-takeoff worlds, at least not due to the ability to iterate. The hard parts of the alignment problem are the parts where it’s nonobvious that something is wrong. That’s true regardless of how fast takeoff speeds are.
This is the important part and it seems wrong.
Firstly, there’s going to be a community of people trying to find and fix the hard problems, and if they have longer to do that then they will be more likely to succeed.
Secondly, ‘nonobvious’ isn’t a an all-or-nothing term. There can easily be problems which are nonobvious enough that you don’t notice them with weeks of adversarial training but you do notice them with months or years.
Toy model: we have some system with a bunch of problems. A group of people with some fixed skills/background will be able to find 80% of the problems given enough time; the remaining 20% are problems which they won’t find at all, because it won’t occur to them to ask the right questions. (The air conditioner pulling in hot air in the far corners of the house is meant to be an example of such a problem, relative to the skills/background of a median customer.) For the 80% of problems which the group can find, the amount of time required to find them has a wide tail: half the problems can be found in a week, another 25% in another two weeks, another 12.5% in another four weeks, etc.
(The numbers in this setup aren’t meant to be realistic; the basic idea I want to illustrate should occur for a fairly wide range of distributions.)
In this toy model:
the group is more likely to find any given problem if given more time
‘nonobvious’ is not all-or-nothing; there are problems which won’t be found in a week but will be found in a year.
So this toy model matches both of your conditions.
What happens in this toy model? Well, after a bit over two years, 79.5% of the problems have been found. Almost all of the remaining 20.5% are problems which the group will not find, given any amount of time, because they do not have the skills/background to ask the right questions. They will still keep improving things over time, but it’s not going to make a large quantitative difference.
Point is: you are arguing that there exist problems which will be found given more time. That is not the relevant claim. In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.
In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).
And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.
In your toy model there’s 100% chance that we’re doomed.
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
For what it’s worth, I’ve had a similar discussion with John in another comment thread where he said that he doesn’t believe the probability of doom is 1, he just believes it’s some p≫0 that doesn’t depend too much on the time we have to work on problems past a time horizon of 1 week or so.
This is consistent with your model and so I don’t think John actually believes that the probability of doom is 1 and I don’t think he would necessarily disagree with your model either. On the other hand in your model the probability of doom asymptotes to some p≫0 as extra time goes to infinity, so it’s also not true that extra time would be very helpful in this situation past a certain point.
TBC, I believe that the value of more time rapidly asymptotes specifically for the purpose of finding problems by trying things and seeing what goes wrong. More time is still valuable for progress via other channels.
This is the important part and it seems wrong.
Firstly, there’s going to be a community of people trying to find and fix the hard problems, and if they have longer to do that then they will be more likely to succeed.
Secondly, ‘nonobvious’ isn’t a an all-or-nothing term. There can easily be problems which are nonobvious enough that you don’t notice them with weeks of adversarial training but you do notice them with months or years.
Toy model: we have some system with a bunch of problems. A group of people with some fixed skills/background will be able to find 80% of the problems given enough time; the remaining 20% are problems which they won’t find at all, because it won’t occur to them to ask the right questions. (The air conditioner pulling in hot air in the far corners of the house is meant to be an example of such a problem, relative to the skills/background of a median customer.) For the 80% of problems which the group can find, the amount of time required to find them has a wide tail: half the problems can be found in a week, another 25% in another two weeks, another 12.5% in another four weeks, etc.
(The numbers in this setup aren’t meant to be realistic; the basic idea I want to illustrate should occur for a fairly wide range of distributions.)
In this toy model:
the group is more likely to find any given problem if given more time
‘nonobvious’ is not all-or-nothing; there are problems which won’t be found in a week but will be found in a year.
So this toy model matches both of your conditions.
What happens in this toy model? Well, after a bit over two years, 79.5% of the problems have been found. Almost all of the remaining 20.5% are problems which the group will not find, given any amount of time, because they do not have the skills/background to ask the right questions. They will still keep improving things over time, but it’s not going to make a large quantitative difference.
Point is: you are arguing that there exist problems which will be found given more time. That is not the relevant claim. In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).
And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.
The toy model says there’s 100% chance of doom if the only way we find problems is by iteratively trying things and seeing what visibly goes wrong. A core part of my view here is that there’s lots of problems which will not be noticed by spending any amount of time iterating on a black box, but will be found if we can build the mathematical tools to open the black box. I do think it’s possible to build sufficiently-good mathematical tools that literally all the problems are found (see the True Names thing).
More time does help with building those tools, but more time experimenting with weak AI systems doesn’t matter so much. Experimenting with AI systems does provide some feedback for the theory-building, but we can get an about-as-good feedback signal from other agenty systems in the world already. So the slow/fast takeoff question isn’t particularly relevant.
Man, it would be one hell of a miracle if the number of fatal problems which would not be found by any amount of iterating just so happened to be exactly zero. Probabilities are never literally zero, but that does seem to me unlikely enough as to be strategically irrelevant.
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
Yeah, that’s right. Of your three channels for impact:
… (i) and (ii) both work ~only to the extent that the important problems are visible. Demonstrating alignment problems empirically ~only matters if they’re visible and obvious. Trying out different alignment proposals also ~only matters if their failure modes are actually detectable.
(iii) fails for a different reason, namely that by the time AIs are able to significantly accelerate the hard parts of alignment work, they’ll already have foomed. Reasoning: there’s generally a transition point between “AI is worse than human at task, so task is mostly done by human” and “AI is comparable to human or better, so task is mostly done by AI”. Foom occurs roughly when AI crosses that transition point for AI research itself. And alignment is technically similar enough to AI research more broadly that I expect the transition to be roughly-simultaneous for capabilities and alignment research.
Quick responses to your argument for (iii).
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
For what it’s worth, I’ve had a similar discussion with John in another comment thread where he said that he doesn’t believe the probability of doom is 1, he just believes it’s some p≫0 that doesn’t depend too much on the time we have to work on problems past a time horizon of 1 week or so.
This is consistent with your model and so I don’t think John actually believes that the probability of doom is 1 and I don’t think he would necessarily disagree with your model either. On the other hand in your model the probability of doom asymptotes to some p≫0 as extra time goes to infinity, so it’s also not true that extra time would be very helpful in this situation past a certain point.
TBC, I believe that the value of more time rapidly asymptotes specifically for the purpose of finding problems by trying things and seeing what goes wrong. More time is still valuable for progress via other channels.