Suppose the long-term risk center’s researchers, or a random group of teenage nerd hackers, or whatever, come up with what they call an “alignment solution”. A really complicated and esoteric, yet somehow elegant, way of describing what we really, really value, and cramming it into a big mess of virtual neurons. Suppose Eliezer and Tammy and Hanson and Wentworth and everyone else all go and look at the “alignment solution” very carefully for a very long time, and do not find any flaws in it. Lastly, suppose they test it on a weak AI and the AI immediately stops producing strange outputs/deceiving supervisors/specification gaming, and starts acting super nice and reasonable.
Great, right? Awesome, right? We won eternal eutopia, right? Our hard work finally paid off, right?
Even if this were to happen, I would still be skeptical to plug our new, shiny solution into a superintelligence and hit run. I believe that before we stumble on an alignment solution, we will stumble upon an “alignment solution”—something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot. That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?
Note that Eliezer and I and probably Tammy would all tell you that “can’t see how it fails” is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.
That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?
if you truly believe in a 1-to-dozens ratio between[1] real and ‘fake’ (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1⁄24
I believe that before we stumble on an alignment solution, we will stumble upon an “alignment solution”—something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot
i suggest writing why you believe that. in particular, how do you estimate the prominence of ‘unnoticeably flawed’ alignment solutions given they are unnoticeable (to any human)?[2] where does “for every true alignment solution, there are dozens of fake ones” come from?
A really complicated and esoteric, yet somehow elegant
why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.
(there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the ‘final’ version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)
Yes, human intelligence augmentation sounds like a good idea.
There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves. Supposes there are strategies which ineffectiveness is only obvious and explainable by people who know way more about decisions and agents and optimal strategies and stuff than humanity has currently figured out thus far. (Analogy: A society who only know basic arithmetic could reasonably stumble upon and understand the Collatz conjecture; and yet, with all our mathematical development, we can’t do anything to prove it. Just like we could reasonably stumble upon an “alignment solution” that we can’t disprove that it would work, because that would take a much higher understanding of these kinds of situations.)
If the solution to alignment were simple, we would have found it by now. Humans are far from simple, human brains are far from simple, human behavior is far from simple. That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves.
yep but the first three all fail for the same reason: not believing (enough) that programs will do what they actually say to do, including in response to your actions. (the fourth one, ‘use a weaker AI to align it’, is at least that obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
(there are some issues which a programmatic model doesn’t automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is a bit of evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
i guess my next response then would be that some subjects are bounded, and we might notice (if not ‘be able to prove’) such bounds telling us ‘theres not more things beyond what you have already written down’, which would be negative evidence. (this is more of an intuition, i don’t know how to elaborate this)
(also on what johnswentworth wrote. a similar point i was considering making is that you have set up the question such that you’re forced into playing a game similar to ‘show how you’d outsmart a superintelligence’ - for any consideration you can think of, one can respond that eliezer and others will probably also think of it, which might preclude them from actually approving, which makes your conditional ‘they approve but its wrong’ harder to be true and basically dependent on them instead of object-level properties of alignment.
you may want to set it up instead to be conditional on / about the likelyhood of {they both don’t notice flaws and think it’s unlikely to succeed}, to (1) not require you to know something they don’t to make the question true, and (2) abstract away the world where they might approve despite thinking success is unlikely (out of no better choice).)
i am interested in reading more arguments about that object-level difficulty level if anyone has them.
If the solution to alignment were simple, we would have found it by now [...] That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
[Question] fake alignment solutions????
Suppose the long-term risk center’s researchers, or a random group of teenage nerd hackers, or whatever, come up with what they call an “alignment solution”. A really complicated and esoteric, yet somehow elegant, way of describing what we really, really value, and cramming it into a big mess of virtual neurons. Suppose Eliezer and Tammy and Hanson and Wentworth and everyone else all go and look at the “alignment solution” very carefully for a very long time, and do not find any flaws in it. Lastly, suppose they test it on a weak AI and the AI immediately stops producing strange outputs/deceiving supervisors/specification gaming, and starts acting super nice and reasonable.
Great, right? Awesome, right? We won eternal eutopia, right? Our hard work finally paid off, right?
Even if this were to happen, I would still be skeptical to plug our new, shiny solution into a superintelligence and hit run. I believe that before we stumble on an alignment solution, we will stumble upon an “alignment solution”—something that looks like an alignment solution, but is flawed in some super subtle, complicated way that means that Earth still gets disassembled into compute or whatever, but the flaw is too subtle and complicated for even the brightest humans to spot. That for every true alignment solution, there are dozens of fake ones.
Is this something that I should be seriously concerned about?
Note that Eliezer and I and probably Tammy would all tell you that “can’t see how it fails” is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.
if you truly believe in a 1-to-dozens ratio between[1] real and ‘fake’ (endorsed by eliezer and others but unnoticedly flawed) solutions, then yes. in that case, you would naturally favor something like human intelligence augmentation, at least if you thought it had a chance of succeeding greater than p(chance of a solution being proposed which eliezer and others deem correct) × 1⁄24
i suggest writing why you believe that. in particular, how do you estimate the prominence of ‘unnoticeably flawed’ alignment solutions given they are unnoticeable (to any human)?[2] where does “for every true alignment solution, there are dozens of fake ones” come from?
why does the proposed alignment solution have to be really complicated? overlooked mistakes become more likely as complexity increases, so this premise favors your conclusion.
(out of the ones which might be proposed, to avoid technicalities about infinite or implausible-to-be-thought-of proposals)
(there are ways you could in principle, for example if there were a pattern of the researchers continually noticing they made increasingly unintuitive errors up until the ‘final’ version (where they no longer notice, but presumably this pattern would be noticed); or extrapolation from general principles about {some class of programming that includes alignment} (?) being easy to make hard-to-notice unintuitive mistakes in)
Yes, human intelligence augmentation sounds like a good idea.
There are all sorts of “strategies” (turn it off, raise it like a kid, disincentivize changing the environment, use a weaker AI to align it) that people come up with when they’re new to the field of AI safety, but that are ineffective. And their ineffectiveness is only obvious and explainable by people who specifically know how AI behaves. Supposes there are strategies which ineffectiveness is only obvious and explainable by people who know way more about decisions and agents and optimal strategies and stuff than humanity has currently figured out thus far. (Analogy: A society who only know basic arithmetic could reasonably stumble upon and understand the Collatz conjecture; and yet, with all our mathematical development, we can’t do anything to prove it. Just like we could reasonably stumble upon an “alignment solution” that we can’t disprove that it would work, because that would take a much higher understanding of these kinds of situations.)
If the solution to alignment were simple, we would have found it by now. Humans are far from simple, human brains are far from simple, human behavior is far from simple. That there is one simple thing from which comes all of our values, or a simple way to derive such a thing, just seems unlikely.
yep but the first three all fail for the same reason: not believing (enough) that programs will do what they actually say to do, including in response to your actions. (the fourth one, ‘use a weaker AI to align it’, is at least that obviously not itself a solution. the weakest form of it, using an LLM to assist an alignment researcher, is possible, and some less weak forms likely are too.)
(there are some issues which a programmatic model doesn’t automatically make obvious to a human: they must follow from it, but one could fail to see them without making that basic mistake. probable environment hacking and decision theory issues come to mind. i agree that on general priors this is a bit of evidence that there are deeper subjects that would not be noticed even conditional on those researchers approving a solution.)
i guess my next response then would be that some subjects are bounded, and we might notice (if not ‘be able to prove’) such bounds telling us ‘theres not more things beyond what you have already written down’, which would be negative evidence. (this is more of an intuition, i don’t know how to elaborate this)
(also on what johnswentworth wrote. a similar point i was considering making is that you have set up the question such that you’re forced into playing a game similar to ‘show how you’d outsmart a superintelligence’ - for any consideration you can think of, one can respond that eliezer and others will probably also think of it, which might preclude them from actually approving, which makes your conditional ‘they approve but its wrong’ harder to be true and basically dependent on them instead of object-level properties of alignment.
you may want to set it up instead to be conditional on / about the likelyhood of {they both don’t notice flaws and think it’s unlikely to succeed}, to (1) not require you to know something they don’t to make the question true, and (2) abstract away the world where they might approve despite thinking success is unlikely (out of no better choice).)
i am interested in reading more arguments about that object-level difficulty level if anyone has them.
the pointer to values does not need to be complex (even if the values themselves are)