To solve the problem of aligning superhuman systems, you need some amount of complicated human thought/hard high-level work. If a system can output that much hard high-level work in a short amount of time, I consider this system to be superhuman, and the problem of aligning it to be “alignment-complete” in the sense that if you solve any of the problems in this class, you essentially solve alignment down the line and probably avoid x-risk, but solving any of these problems requires a lot of hard human work, and safely automating so much the hard work is an alignment-complete problem.
There needs to be an argument for why one can successfully use a subhuman system to control a complicated superhuman system, as otherwise, having generations of controllable subhuman systems doesn’t matter.
Thinking carefully about these things (rather than rehashing MIRI-styled arguments a bit carelessly) is actually important, because it can change the strategic (alignment-relevant) landscape; e.g. from Before smart AI, there will be many mediocre or specialized AIs:
Assuming that much of this happens “behind the scenes”, a human interacting with this system might just perceive it as a single super-smart AI. Nevertheless, I think this means that AI will be more alignable at a fixed level of productivity. (Eventually, we’ll face the full alignment problem — but “more alignable at a fixed level of productivity” helps if we can use that productivity for something useful, such as giving us more time or helping us with alignment research.)
Most obviously, the token-by-token output of a single AI system should be quite easy for humans to supervise and monitor for danger. It will rarely contain any implicit cognitive leaps that a human couldn’t have generated themselves. (C.f. visible thoughts project and translucent thoughts hypothesis.)
A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking. There’s a requirement for some amount of problem-solving of the kind hardest human thinking produces to go into the problem. I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous.
I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought.
Safely getting enough productivity out of these systems for it to matter is impossible IMO. If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”. Even putting aside object-level problems with these approaches, this seems pretty much hopeless.
‘A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking.’ ‘I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous.‘-> what about MINERVA, GPT-4, LATS, etc.? Would you say that they’re specialized / dangerous / can’t ‘contribute to the part that require the hardest kind of human thinking’? If the latter, what is the easiest benchmark/task displaying ‘complicated human thought’ a non-dangerous LLM would have to pass/do for you to update?
‘I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought.’ → this seems obviously ridiculously overconfident, e.g. there are many tasks for which verification is easier/takes less time than generation; e.g. (peer) reviewing alignment research; I’d encourage you to try to operationalize this for a prediction market.
‘Safely getting enough productivity out of these systems for it to matter is impossible IMO.’ → this is a very strong claim backed by no evidence; I’ll also note that ‘for it to matter’ should be a pretty low bar, given the (relatively) low amount of research work that has gone into (especially superintelligence) alignment and the low number of current FTEs.
‘If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”.’ → It seems to me a bit absurd to ask for these kinds of details (years, number of people) years in advance; e.g. should I ask of various threat models how many AI researchers would work for how long on the (first) AI that takes over, before I put any credence on them? But yes, I do expect automated alignment researchers to be able to solve a wide variety of problems very safely (on top of all the automated math research which is easily verifiable), including scalable oversight (e.g. see RLAIF already) and automated mech interp (e.g. see OpenAI’s recent work and automated circuit discovery). More generally, I expect even if you used systems ~human-level on a relatively broad set of tasks (so as to have very high confidence you fully cover everything a median human alignment researcher can do), the takeover risks from only using them internally for automated alignment research (for at least months of calendar time) could relatively easily be driven << 0.1%, even just through quite obvious/prosaic safety/alignment measures, like decent evals and red-teaming, adversarial training, applying safety measures like those from recent safety work by Redwood (e.g. removing steganography + monitoring forced intermediate text outputs), applying the best prosaic alignment methods at the time (e.g. RLH/AIF variants as of today), unlearning (e.g. of ARA, cyber, bio capabilities), etc.
Thinking carefully about these things (rather than rehashing MIRI-styled arguments a bit carelessly) is actually important, because it can change the strategic (alignment-relevant) landscape; e.g. from Before smart AI, there will be many mediocre or specialized AIs:
A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking. There’s a requirement for some amount of problem-solving of the kind hardest human thinking produces to go into the problem. I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous. I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought. Safely getting enough productivity out of these systems for it to matter is impossible IMO. If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”. Even putting aside object-level problems with these approaches, this seems pretty much hopeless.
‘A specialised AI can speed up Infra-Bayesianism by the same amount random mathematicians can, by proving theorems and solving some math problems. A specialised AI can’t actually understand the goals of the research and contribute to the part that require the hardest kind of human thinking.’ ‘I claim that if a system can output enough of that kind of thinking to meaningfully contribute, then it’s going to be smart enough to be dangerous.‘-> what about MINERVA, GPT-4, LATS, etc.? Would you say that they’re specialized / dangerous / can’t ‘contribute to the part that require the hardest kind of human thinking’? If the latter, what is the easiest benchmark/task displaying ‘complicated human thought’ a non-dangerous LLM would have to pass/do for you to update?
‘I further claim that there’s a number of hours of complicated-human-thought such that making a safe system that can output work corresponding to that number in less than, e.g., 20 years, requires at least that number of hours of complicated human thought.’ → this seems obviously ridiculously overconfident, e.g. there are many tasks for which verification is easier/takes less time than generation; e.g. (peer) reviewing alignment research; I’d encourage you to try to operationalize this for a prediction market.
‘Safely getting enough productivity out of these systems for it to matter is impossible IMO.’ → this is a very strong claim backed by no evidence; I’ll also note that ‘for it to matter’ should be a pretty low bar, given the (relatively) low amount of research work that has gone into (especially superintelligence) alignment and the low number of current FTEs.
‘If you think a system can solve specific problems, then please outline these problems (what are the hardest problem you expect to be able to safely solve with your system?) and say how fast the system is going to solve it and how many people will be supervising its “thoughts”.’ → It seems to me a bit absurd to ask for these kinds of details (years, number of people) years in advance; e.g. should I ask of various threat models how many AI researchers would work for how long on the (first) AI that takes over, before I put any credence on them? But yes, I do expect automated alignment researchers to be able to solve a wide variety of problems very safely (on top of all the automated math research which is easily verifiable), including scalable oversight (e.g. see RLAIF already) and automated mech interp (e.g. see OpenAI’s recent work and automated circuit discovery). More generally, I expect even if you used systems ~human-level on a relatively broad set of tasks (so as to have very high confidence you fully cover everything a median human alignment researcher can do), the takeover risks from only using them internally for automated alignment research (for at least months of calendar time) could relatively easily be driven << 0.1%, even just through quite obvious/prosaic safety/alignment measures, like decent evals and red-teaming, adversarial training, applying safety measures like those from recent safety work by Redwood (e.g. removing steganography + monitoring forced intermediate text outputs), applying the best prosaic alignment methods at the time (e.g. RLH/AIF variants as of today), unlearning (e.g. of ARA, cyber, bio capabilities), etc.