I believe there are people with far greater knowledge than me that can point out where I am wrong. Cause I do believe my reasoning is wrong, but I can not see why it would be highly unfeasible to train a sub-AGI intelligent AI that most likely will be aligned and able to solve AI alignment.
My assumptions are as follows:
Current AI seems aligned to the best of its ability.
PhD level researchers would eventually solve AI alignment if given enough time.
PhD level intelligence is below AGI in intelligence.
There is no clear reason why current AI using current paradigm technology would become unaligned before reaching PhD level intelligence.
We could train AI until it reaches PhD level intelligence, and then let it solve AI Alignment, without itself needing to self improve.
The point I am least confident in, is 4, since we have no clear way of knowing at what intelligence level an AI model would become unaligned.
Multiple organisations seem to already think that training AI that solves alignment for us is the best path (e.g. superalignment).
Attached is my mental model of what intelligence different tasks require, and different people have.
Figure 1: My mental model of natural research capability RC (basically IQ with higher correlation for research capabilities), where intelligence needed to align AI is above average PhD level, but below smartest human in the world, and even further from AGI.
Not to derail on details, but what would it mean to solve alignment?
To me “solve” feels overly binary and final compared to the true challenge of alignment. Like, would solving alignment mean:
someone invents and implements a system that causes all AIs to do what their developer wants 100% of the time?
someone invents and implements a system that causes a single AI to do what its developer wants 100% of the time?
someone invents and implements a system that causes a single AI to do what its developer wants 100% of the time, and that AI and its descendants are always more powerful than other AIs for the rest of history?
ditto but 99.999%?
ditto but 99%?
And there any distinction between an AI that is misaligned by mistake (e.g. thinks I’ll want vanilla but really I want chocolate) vs knowingly misaligned (e.g., gives me vanilla knowing i want chocolate so it can achieve its own ends)?
I’m really not sure which you mean, which makes it hard for me to engage with your question.
Human PhDs are generally intelligent. If you had an artificial intelligence that was generally intelligent, surely that would be an artificial general intelligence?
It might not be very clear, but as stated in the diagram, AGI is defined here as capable of passing the turing test, as defined by Alan Turing.
An AGI would likely need to surpass the intelligence, rather than be equal to, the adversaries it is doing the turing test with.
For example, if the AGI had IQ/RC of 150, two people with 160 IQ/RC should more than 50% of the time be able to determine if they are speaking with a human or an AI.
Further, two 150 IQ/RC people could probably guess which one is the AI, since the AI has the additional difficult apart from being intelligent, to also simulate being a human well enough to be indistinguishable for the judges.
Seems extremely dubious passing the Turing test is strongly linked to solving the alignment problem.
Agreed. Passing the Turing test requires equal or greater intelligence than human in every single aspect, while the alignment problem may be possible to solve with only human intelligence.
What’s your model here, that as part of the Turing Test they ask the participant to solve the alignment problem and check whether the solution is correct? Isn’t this gonna totally fail due to 1) it taking too long, 2) not knowing how to robustly verify a solution, 3) some people/PhDs just randomly not being able to solve the alignment problem? And probably more.
So no, I don’t think passing a PhD-level Turing Test requires the ability to solve alignment.
If there exist such a problem that a human can think of, can be solved by a human and verified by a human, an AI would need to be able to solve that problem as well as to pass the Turing test.
If there exist some PhD level intelligent people that can solve the alignment problem, and some that can verify it (which is likely easier). Then an AI that can not solve AI alignment would not pass the Turing test.
With that said, a simplified Turing test with shorter time limits and a smaller group of participants is much more feasible to conduct.
How do you verify a solution to the alignment problem? Or if you don’t have a verification method in mind, why assume it is easier than making a solution?
Great question.
I’d say that having a way to verify that a solution to the alignment problem is actually a solution, is part of solving the alignment problem.
But I understand this was not clear from my previous response.
A bit like a mathematical question, you’d be expected to be able to show that your solution is correct, not only guess that maybe your solution is correct.
Points (1) and (4) seem the weakest here, and the rest not very relevant.
There are hundreds of examples already published and even in mainstream public circulation where current AI does not behave in human interests to the best of its ability. Mostly though they don’t even do anything relevant to alignment, and much of what they say on matters of human values is actually pretty terrible. This is despite the best efforts of human researchers who are—for the present—far in advance of AI capabilities.
Even if (1) were true, by the time you get to the sort of planning capability that humans require to carry out long-term research tasks, you also get much improved capabilities for misalignment. It’s almost cute when a current toy AI does things that appear misaligned. It would not be at all cute if a RC 150 (on your scale) AI has the same degree of misalignment “on the inside” but is capable of appearing aligned while it seeks recursive self improvement or other paths that could lead to disaster.
Furthermore, there are surprisingly many humans who are actively trying to make misaligned AI, or at best with reckless disregard to whether their AIs are aligned. Even if all of these points were true, yes perhaps we could train an AI to solve alignment eventually, but will that be good enough to catch every possible AI that may be capable of recursive self-improvement or other dangerous capabilities before alignment is solved, or without applying that solution?
Really? I’m not aware of any examples of this.
One fairly famous example is that it is better to allow millions of people to be killed by a terrorist nuke than to disarm it by saying a password that is a racial slur.
Obviously any current system is too incoherent and powerless to do anything about acting on such a moral principle, so it’s just something we can laugh at and move on. A capable system that enshrined that sort of moral ordering in a more powerful version of itself would quite predictably lead to catastrophe as soon as it observed actual human behaviour.
It’s always hard to say whether this is an alignment or capabilities problem. It’s also too contrived too offer much signal.
The overall vibe is these LLMs grasp most of our values pretty well. They give common sense answers to most moral questions. You can see them grasp Chinese values pretty well too, so n=2. It’s hard to characterize this as mostly “terrible”.
This shouldn’t be too surprising in retrospect. Our values are simple for LLMs to learn. It’s not going to disassemble cows for atoms to end racism.There are edge cases where it’s too woke, but these got quickly fixed. I don’t expect them to ever pop up again.