I read “capable of X” as meaning something like “if the model was actively trying to do X then it would do X”. I.e. a misaligned model doesn’t reveal the vulnerability to humans during testing bc it doesn’t want them to patch it, but then later it exploits that same vulnerability during deployment bc it’s trying to hack the computer system
Tom Davidson
Let’s use AI to harden human defenses against AI manipulation
What tags are they?
Which AI outputs should humans check for shenanigans, to avoid AI takeover? A simple model
I agree that the final tasks that humans do may look like “check that you understand and trust the work the AIs have done”, and that a lack of trust is a plausible bottleneck to full automation of AI research.
I don’t think the only way for humans at AI labs to get that trust is to automate alignment research, though that is one way. Human-conducted alignment research might lead them to trust AIs, or they might have a large amount of trust in the AIs’ work without believing they are aligned. E.g. they separate the workflow into lots of narrow tasks that can be done by a variety of non-agentic AIs that they don’t think pose a risk; or they set up a system of checks and balances (where different AIs check each other’s work and look for signs of deception) that they trust despite thinking certain AIs may be unaligned, they do such extensive adversarial training that they’re confident that the AIs would never actual try to do anything deceptive in practice (perhaps because they’re paranoid that a seeming opportunity to trick humans is just a human-designed test of their alignment). TBC, I think “being confident that the AIs are aligned” is better and more likely than these alternative routes to trusting the work.
Also, when I’m forecasting AI capabilities i’m forecasting AI that could readily automate 100% of AI R&D, not AI that actually does automate it. If trust was the only factor preventing full automation, that could count as AI that could readily automate 100%.
What a compute-centric framework says about AI takeoff speeds
But realistically not all projects will hoard all their ideas. Suppose instead that for the leading project, 10% of their new ideas are discovered in-house, and 90% come from publicly available discoveries accessible to all. Then, to continue the car analogy, it’s as if 90% of the lead car’s acceleration comes from a strong wind that blows on both cars equally. The lead of the first car/project will lengthen slightly when measured by distance/ideas, but shrink dramatically when measured by clock time.
The upshot is that we should return to that table of factors and add a big one to the left-hand column: Leads shorten automatically as general progress speeds up, so if the lead project produces only a small fraction of the general progress, maintaining a 3-year lead throughout a soft takeoff is (all else equal) almost as hard as growing a 3-year lead into a 30-year lead during the 20th century. In order to overcome this, the factors on the right would need to be very strong indeed.
But won’t “ability to get a DSA” be linked to the lead as measured in ideas rather than clock time?
Quick responses to your argument for (iii).
If AI automates 50% of both alignment work and capabilities research, it could help with alignment before foom (while also bringing foom forward in time)
A leading project might choose to use AIs for alignment rather for fooming
AI might be more useful for alignment work than for capabilities work
fooming may require may compute than certain types of alignment work
It sounds like the crux is whether having time with powerful (compared to today) but sub-AGI systems will make the time we have for alignment better spent. Does that sound right?
I’m thinking it will because i) you can better demonstrate AI alignment problems empirically to convince top AI researchers to prioritise safety work, ii) you can try out different alignment proposals and do other empirical work with powerful AIs, iii) you can try to leverage powerful AIs to help you do alignment research itself.
Whereas you think these things are so unlikely to help that getting more time with powerful AIs is strategically irrelevant
In order to argue that alignment is importantly easier in slow takeoff worlds, you need to argue that there do not exist fatal problems which will not be found given more time.
I need something weaker; just that we should put some probability on there not being fatal problems which will not be found given more time. (I.e. , some probability that the extra time helps us find the last remaining fatal problems).
And that seems reasonable. In your toy model there’s 100% chance that we’re doomed. Sure, in that case extra time doesn’t help. But in models where our actions can prevent doom, extra time typically will help. And I think we should be uncertain enough about difficulty of the problem that we should put some probability on worlds where our actions can prevent doom. So we’ll end up concluding that more time does help.
Corollary: alignment is not importantly easier in slow-takeoff worlds, at least not due to the ability to iterate. The hard parts of the alignment problem are the parts where it’s nonobvious that something is wrong. That’s true regardless of how fast takeoff speeds are.
This is the important part and it seems wrong.
Firstly, there’s going to be a community of people trying to find and fix the hard problems, and if they have longer to do that then they will be more likely to succeed.
Secondly, ‘nonobvious’ isn’t a an all-or-nothing term. There can easily be problems which are nonobvious enough that you don’t notice them with weeks of adversarial training but you do notice them with months or years.
Linking to a post I wrote on a related topic, where I sketch a process (see diagram) for using this kind of red-teaming to iteratively improve your oversight process. (I’m more focussed on a scenario where you’re trying to offload as much of the work in evaluating and improving your oversight process to AIs)