Navigate the initial alignment problem:2 getting to the first point of having very powerful (human-level-ish), yet safe, AI systems. For human-level-ish AIs, I think it’s plausible that the alignment problem is easy, trivial or nonexistent. It’s also plausible that it’s fiendishly hard.
Can you clarify what you mean by human-level-ish and safe? These terms seem almost contradictory to me—human-ish cognition is extremely unsafe in literal humans, and not just because it could be misused or directed towards dangerous or destructive ends by other humans.
”Transformative and safe” (the phrase you use in Nearcast-based “deployment problem” analysis) seems less contradictory. I can imagine AI systems and other technologies that are transformative (e.g. biotech, nanotech, AI that is far below human-level at general reasoning but superhuman in specific domains), and still safe or mostly safe when not deliberately misused by bad actors.
Dangerousness of human-level cognition has nothing to do with how hard or easy alignment of artificial systems is: a literal human, trapped in an AI lab, can probably escape or convince the lab to give them “parole” (meaning anything more than fully-controlled and fully-monitored-in-realtime access to the internet). Literal mind-reading might be sufficient to contain most or all humans, but I don’t think interpretability tools currently provide anything close to “mind-reading” for AI systems, and other security precautions that AI labs currently take also seem insufficient for containing literal humans under even mildly adversarial conditions.
Maybe alignment turns out to be easy, and / or it turns out to be trivial to make the AI not want to escape or get parole. But alignment on that level is definitely not a solved problem in humans, so the aim for human-level AI has always seemed kind of strange to me.
To be clear, “it turns out to be trivial to make the AI not want to escape” is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like “Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged” might not have many or any “use cases.”
A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.
So the idea isn’t that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).
I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive.
Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie.
Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it’s probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely.
OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don’t think that’s what SGD or current RL methods are actually doing. (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that’s a different topic.)
In the words of Alex Turner, in RL, “reward chisels cognitive grooves into an agent”. Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level in the first place.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.
Can you clarify what you mean by human-level-ish and safe? These terms seem almost contradictory to me—human-ish cognition is extremely unsafe in literal humans, and not just because it could be misused or directed towards dangerous or destructive ends by other humans.
”Transformative and safe” (the phrase you use in Nearcast-based “deployment problem” analysis) seems less contradictory. I can imagine AI systems and other technologies that are transformative (e.g. biotech, nanotech, AI that is far below human-level at general reasoning but superhuman in specific domains), and still safe or mostly safe when not deliberately misused by bad actors.
Dangerousness of human-level cognition has nothing to do with how hard or easy alignment of artificial systems is: a literal human, trapped in an AI lab, can probably escape or convince the lab to give them “parole” (meaning anything more than fully-controlled and fully-monitored-in-realtime access to the internet). Literal mind-reading might be sufficient to contain most or all humans, but I don’t think interpretability tools currently provide anything close to “mind-reading” for AI systems, and other security precautions that AI labs currently take also seem insufficient for containing literal humans under even mildly adversarial conditions.
Maybe alignment turns out to be easy, and / or it turns out to be trivial to make the AI not want to escape or get parole. But alignment on that level is definitely not a solved problem in humans, so the aim for human-level AI has always seemed kind of strange to me.
To be clear, “it turns out to be trivial to make the AI not want to escape” is a big part of my model of how this might work. The basic thinking is that for a human-level-ish system, consistently reinforcing (via gradient descent) intended behavior might be good enough, because alternative generalizations like “Behave as intended unless there are opportunities to get lots of resources, undetected or unchallenged” might not have many or any “use cases.”
A number of other measures, including AI checks and balances, also seem like they might work pretty easily for human-level-ish systems, which could have a lot of trouble doing things like coordinating reliably with each other.
So the idea isn’t that human-level-ish capabilities are inherently safe, but that straightforward attempts at catching/checking/blocking/disincentivizing unintended behavior could be quite effective for such systems (while such things might be less effective on systems that are extraordinarily capable relative to supervisors).
I see, thanks for clarifying. I agree that it might be straightforward to catch bad behavior (e.g. deception), but I expect that RL methods will work by training away the ability of the system to deceive, rather than the desire.[1] So even if such training succeeds, in the sense that the system robustly behaves honestly, it will also no longer be human-level-ish, since humans are capable of being deceptive.
Maybe it is possible to create an AI system that is like the humans in the movie The Invention of Lying, but that seems difficult and fragile. In the movie, one guy discovers he can lie, and suddenly he can run roughshod over his entire civilization. The humans in the movie initially have no ability to lie, but once the main character discovers it, he immediately realizes its usefulness. The only thing that keeps other people from making the same realization is the fictional conceit of the movie.
Or, paraphrasing Nate: the ability to deceive is a consequence of understanding how the world works on a sufficiently deep level, so it’s probably not something that can be trained away by RL, without also training away the ability to generalize at human levels entirely.
OTOH, if you could somehow imbue an innate desire to be honest into the system without affecting its capabilities, that might be more promising. But again, I don’t think that’s what SGD or current RL methods are actually doing. (Though it is hard to be sure, in part because no current AI systems appear to exhibit desires or inner motivations of any kind. I think attempts to analogize the workings of such systems to desires in humans and components in the brain are mostly spurious pattern-matching, but that’s a different topic.)
In the words of Alex Turner, in RL, “reward chisels cognitive grooves into an agent”. Rewarding non-deceptive behavior could thus chisel away the cognition capable of performing the deception, but that cognition might be what makes the system human-level in the first place.
Hm, it seems to me that RL would be more like training away the desire to deceive, although I’m not sure either “ability” or “desire” is totally on target—I think something like “habit” or “policy” captures it better. The training might not be bulletproof (AI systems might have multiple goals and sometimes notice that deception would help accomplish much), but one doesn’t need 100% elimination of deception anyway, especially not when combined with effective checks and balances.
I notice I don’t have strong opinions on what effects RL will have in this context: whether it will change just surface level specific capabilities, whether it will shift desires/motivations behind the behavior, whether it’s better to think about these systems as having habits or shards (note I don’t actually understand shard theory that well and this may be a mischaracterization) and RL shifts these, or something else. This just seems very unclear to me right now.
Do either of you have particular evidence that informs your views on this that I can update on? Maybe specifically I’m interested in knowing: assuming we are training with RL based on human feedback on diverse tasks and doing currently known safety things like adversarial training, where does this process actually push the model: toward rule following, toward lying in wait to overthrow humanity, to value its creators, etc. I currently would not be surprised if it led to “playing the training game” and lying in wait, and I would be slightly but not very surprised if it led to some safe heuristics like following rules and not harming humans. I mostly have intuition behind these beliefs.