Current AI seems aligned to the best of its ability.
PhD level researchers would eventually solve AI alignment if given enough time.
PhD level intelligence is below AGI in intelligence.
There is no clear reason why current AI using current paradigm technology would become unaligned before reaching PhD level intelligence.
We could train AI until it reaches PhD level intelligence, and then let it solve AI Alignment, without itself needing to self improve.
Points (1) and (4) seem the weakest here, and the rest not very relevant.
There are hundreds of examples already published and even in mainstream public circulation where current AI does not behave in human interests to the best of its ability. Mostly though they don’t even do anything relevant to alignment, and much of what they say on matters of human values is actually pretty terrible. This is despite the best efforts of human researchers who are—for the present—far in advance of AI capabilities.
Even if (1) were true, by the time you get to the sort of planning capability that humans require to carry out long-term research tasks, you also get much improved capabilities for misalignment. It’s almost cute when a current toy AI does things that appear misaligned. It would not be at all cute if a RC 150 (on your scale) AI has the same degree of misalignment “on the inside” but is capable of appearing aligned while it seeks recursive self improvement or other paths that could lead to disaster.
Furthermore, there are surprisingly many humans who are actively trying to make misaligned AI, or at best with reckless disregard to whether their AIs are aligned. Even if all of these points were true, yes perhaps we could train an AI to solve alignment eventually, but will that be good enough to catch every possible AI that may be capable of recursive self-improvement or other dangerous capabilities before alignment is solved, or without applying that solution?
One fairly famous example is that it is better to allow millions of people to be killed by a terrorist nuke than to disarm it by saying a password that is a racial slur.
Obviously any current system is too incoherent and powerless to do anything about acting on such a moral principle, so it’s just something we can laugh at and move on. A capable system that enshrined that sort of moral ordering in a more powerful version of itself would quite predictably lead to catastrophe as soon as it observed actual human behaviour.
It’s always hard to say whether this is an alignment or capabilities problem. It’s also too contrived too offer much signal.
The overall vibe is these LLMs grasp most of our values pretty well. They give common sense answers to most moral questions. You can see them grasp Chinese values pretty well too, so n=2. It’s hard to characterize this as mostly “terrible”.
This shouldn’t be too surprising in retrospect. Our values are simple for LLMs to learn. It’s not going to disassemble cows for atoms to end racism.There are edge cases where it’s too woke, but these got quickly fixed. I don’t expect them to ever pop up again.
Points (1) and (4) seem the weakest here, and the rest not very relevant.
There are hundreds of examples already published and even in mainstream public circulation where current AI does not behave in human interests to the best of its ability. Mostly though they don’t even do anything relevant to alignment, and much of what they say on matters of human values is actually pretty terrible. This is despite the best efforts of human researchers who are—for the present—far in advance of AI capabilities.
Even if (1) were true, by the time you get to the sort of planning capability that humans require to carry out long-term research tasks, you also get much improved capabilities for misalignment. It’s almost cute when a current toy AI does things that appear misaligned. It would not be at all cute if a RC 150 (on your scale) AI has the same degree of misalignment “on the inside” but is capable of appearing aligned while it seeks recursive self improvement or other paths that could lead to disaster.
Furthermore, there are surprisingly many humans who are actively trying to make misaligned AI, or at best with reckless disregard to whether their AIs are aligned. Even if all of these points were true, yes perhaps we could train an AI to solve alignment eventually, but will that be good enough to catch every possible AI that may be capable of recursive self-improvement or other dangerous capabilities before alignment is solved, or without applying that solution?
Really? I’m not aware of any examples of this.
One fairly famous example is that it is better to allow millions of people to be killed by a terrorist nuke than to disarm it by saying a password that is a racial slur.
Obviously any current system is too incoherent and powerless to do anything about acting on such a moral principle, so it’s just something we can laugh at and move on. A capable system that enshrined that sort of moral ordering in a more powerful version of itself would quite predictably lead to catastrophe as soon as it observed actual human behaviour.
It’s always hard to say whether this is an alignment or capabilities problem. It’s also too contrived too offer much signal.
The overall vibe is these LLMs grasp most of our values pretty well. They give common sense answers to most moral questions. You can see them grasp Chinese values pretty well too, so n=2. It’s hard to characterize this as mostly “terrible”.
This shouldn’t be too surprising in retrospect. Our values are simple for LLMs to learn. It’s not going to disassemble cows for atoms to end racism.There are edge cases where it’s too woke, but these got quickly fixed. I don’t expect them to ever pop up again.