Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)
Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent’s world model and tactics will be automatically fixed by reasoning and data, but its inner objective won’t be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent’s inner objective when training on easy problems, when the agent isn’t very capable, such that this objective remains stable as the agent becomes more powerful. Also, there’s empirical evidence that alignment generalizes surprisingly well: several thousand instruction following examples radically improve the aligned behavior on a wide distribution of language tasks (InstructGPT paper) a prompt with about 20 conversations gives much better behavior on a wide variety of conversational inputs (HHH paper). Making a contemporary language model well-behaved seems to be much easier than teaching it a new cognitive skill.
Human raters make systematic errors—regular, compactly describable, predictable errors.… This is indeed one of the big problems of outer alignment, but there’s lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it’s easier to verify proofs than to write them. Obviously alignment isn’t about writing proofs, but the general principle does apply. You can reduce “behaving well” to “answering questions truthfully” by asking questions like “did the agent follow the instructions in this episode?”, and use those to define the reward function. These questions are not formulated in formal language where verification is easy, but there’s reason to believe that verification is also easier than proof-generation for informal arguments.
But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
My model of Eliezer claims that there are some capabilities that are ‘smooth’, like “how large a times table you’ve memorized”, and some are ‘lumpy’, like “whether or not you see the axioms behind arithmetic.” While it seems plausible that we can iteratively increase smooth capabilities, it seems much less plausible for lumpy capabilities.
A specific example: if you have a neural network with enough capacity to 1) memorize specific multiplication Q+As and 2) implement a multiplication calculator, my guess is that during training you’ll see a discontinuity in how many pairs of numbers it can successfully multiply.[1] It is not obvious to me whether or not there are relevant capabilities like this that we’ll “find with neural nets” instead of “explicitly programming in”; probably we will just build AlphaZero so that it uses MCTS instead of finding MCTS with gradient descent, for example.
[edit: actually, also I don’t think I get how you’d use a ‘smaller times table’ to oversee a ‘bigger times table’ unless you already knew how arithmetic worked, at which point it’s not obvious why you’re not just writing an arithmetic program.]
That it might be possible to establish an agent’s inner objective when training on easy problems, when the agent isn’t very capable, such that this objective remains stable as the agent becomes more powerful.
IMO this runs into two large classes of problems, both of which I put under the heading ‘ontological collapse’.
First, suppose the agent’s inner objective is internally located: “seek out pleasant tastes.” Then you run into 16 and 17, where you can’t quite be sure what it means by “pleasant tastes”, and you don’t have a great sense of what “pleasant tastes” will extrapolate to at the next level of capabilities. [One running “joke” in EA is that, on some theories of what morality is about, the highest-value universe is one which contains an extremely large number of rat brains on heroin. I think this is the correct extrapolation / maximization of at least one theory which produces good behavior when implemented by humans today, which makes me pretty worried about this sort of extrapolation.]
Second, suppose the agent’s inner objective is externally located: “seek out mom pressing the reward button”. Then you run into 18, which argues that once the agent realizes that the ‘reward button’ is an object in its environment instead of a communication channel between the human and itself, it may optimize for the object instead of ‘being able to hear what the human would freely communicate’ or whatever philosophically complicated variable it is that we care about. [Note that attempts to express this often need multiple patches and still aren’t fixed; “mom approves of you” can be coerced, “mom would freely approve of you” has a trouble where you have some freedom in identifying your concept of ‘mom’ which means you might pick one who happens to approve of you.]
there’s lots of ongoing research and promising ideas for fixing it.
I’m optimistic about this too, but… I want to make sure we’re looking at the same problem, or something? I think my sense is best expressed in Stanovich and West, where they talk about four responses to the presence of systematic human misjudgments. The ‘performance error’ response is basically the ‘epsilon-rationality’ assumption; 1-ε of the time humans make the right call, and ε of the time they make a random call. While a fine model of performance errors, it doesn’t accurately predict what’s happening with systematic errors, which are predictable instead of stochastic.
I sometimes see people suggest that the model should always or never conform to the human’s systematic errors, but it seems to me like we need to somehow distinguish between systematic “errors” that are ‘value judgments’ (“oh, it’s not that the human prefers 5 deaths to 1 death, it’s that they are opposed to this ‘murder’ thing that I should figure out”) and systematic errors that are ‘bounded rationality’ or ‘developmental levels’ (“oh, it’s not that the (very young) human prefers less water to more water, it’s that they haven’t figured out conservation of mass yet”). It seems pretty sad if we embed all of our confusions into the AI forever—and also pretty sad if we end up not able to transfer any values because all of them look like confusions.[2]
[1] This might depend on what sort of curriculum you train it on; I was imagining something like 1) set the number of digits N=1, 2) generate two numbers uniformly at random between 1 and 2^N, pass them as inputs (sequence of digits?), 3) compare the sequence of digits outputted to the correct answer, either with a binary pass/fail or some sort of continuous similarity metric (so it gets some points for 12x12 = 140 or w/e); once it performs at 90% success check the performance for increased N until you get one with below 80% success and continue training. In that scenario, I think it just memorizes until N is moderately sized (8?), at which point it figures out how to multiply, and then you can increasing N lots without losing accuracy (until you hit some overflow error in its implementation of multiplication from having large numbers).
[2] I’m being a little unfair in using the trolley problem as an example of a value judgment, because in my mind the people who think you shouldn’t pull the lever because it’s murder are confused or missing a developmental jump—but I have the sense that for most value judgments we could find, we can find some coherent position which views it as confused in this way.
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)
Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn’t really have goals and isn’t good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes.
I agree with the “X is safer than Y” claim; I am uncertain whether it’s practically available to us, and much more worried in worlds where it isn’t available.
incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy
For this specific proposal, when I reframe it as “give the system a KL-divergence budget to spend on each change to its policy” I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.
the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn’t really have goals and isn’t good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it’s the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it’d be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.
Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Human raters make systematic errors—regular, compactly describable, predictable errors.… This is indeed one of the big problems of outer alignment, but there’s lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it’s easier to verify proofs than to write them.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Do alignment & safety research, set up regulatory bodies and monitoring systems.
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Not sure exactly what this means. I’m claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there’s any crack in the checking steps, then things that ‘check out’ aren’t trustable, because the proposer can have searched an unimaginably large space (from the rater’s perspective) to find them. [And from the proposer’s perspective, the checking steps are the real spec, not whatever’s in your head.]
In general, I think we can get a minor edge from “checking AI work” instead of “generating our own work” and that doesn’t seem like enough to tackle ‘cognitive megaprojects’ (like ‘cure cancer’ or ‘develop a pathway from our current society to one that can reliably handle x-risk’ or so on). Like, I’m optimistic about “current human scientists use software assistance to attempt to cure cancer” and “an artificial scientist attempts to cure cancer” and pretty pessimistic about “current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer.” It reminds me of translators who complained pretty bitterly about being given machine-translated work to ‘correct’; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn’t nearly as much of a savings as hoped.
Like the value of ‘DocBot attempts to cure cancer’ is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it’s not obvious it will significantly outperform humans.
I did, briefly. I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he’d find helpful.
Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann’s clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I’m surprised things haven’t gone better.
My view is that if Yann continues to be interested in arguing about the issue then there’s something to work with, even if he’s skeptical, and the real worry is if he’s stopped talking to anyone about it (I have no idea personally what his state of mind is right now)
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.
Found this to be an interesting list of challenges, but I disagree with a few points. (Not trying to be comprehensive here, just a few thoughts after the first read-through.)
Several of the points here are premised on needing to do a pivotal act that is way out of distribution from anything the agent has been trained on. But it’s much safer to deploy AI iteratively; increasing the stakes, time horizons, and autonomy a little bit each time. With this iterative approach to deployment, you only need to generalize a little bit out of distribution. Further, you can use Agent N to help you closely supervise Agent N+1 before giving it any power.
One claim is that Capabilities generalize further than alignment once capabilities start to generalize far. The argument is that an agent’s world model and tactics will be automatically fixed by reasoning and data, but its inner objective won’t be changed by these things. I agree with the preceding sentence, but I would draw a different (and more optimistic) conclusion from it. That it might be possible to establish an agent’s inner objective when training on easy problems, when the agent isn’t very capable, such that this objective remains stable as the agent becomes more powerful.
Also, there’s empirical evidence that alignment generalizes surprisingly well: several thousand instruction following examples radically improve the aligned behavior on a wide distribution of language tasks (InstructGPT paper) a prompt with about 20 conversations gives much better behavior on a wide variety of conversational inputs (HHH paper). Making a contemporary language model well-behaved seems to be much easier than teaching it a new cognitive skill.
Human raters make systematic errors—regular, compactly describable, predictable errors.… This is indeed one of the big problems of outer alignment, but there’s lots of ongoing research and promising ideas for fixing it. Namely, using models to help amplify and improve the human feedback signal. Because P!=NP it’s easier to verify proofs than to write them. Obviously alignment isn’t about writing proofs, but the general principle does apply. You can reduce “behaving well” to “answering questions truthfully” by asking questions like “did the agent follow the instructions in this episode?”, and use those to define the reward function. These questions are not formulated in formal language where verification is easy, but there’s reason to believe that verification is also easier than proof-generation for informal arguments.
My model of Eliezer claims that there are some capabilities that are ‘smooth’, like “how large a times table you’ve memorized”, and some are ‘lumpy’, like “whether or not you see the axioms behind arithmetic.” While it seems plausible that we can iteratively increase smooth capabilities, it seems much less plausible for lumpy capabilities.
A specific example: if you have a neural network with enough capacity to 1) memorize specific multiplication Q+As and 2) implement a multiplication calculator, my guess is that during training you’ll see a discontinuity in how many pairs of numbers it can successfully multiply.[1] It is not obvious to me whether or not there are relevant capabilities like this that we’ll “find with neural nets” instead of “explicitly programming in”; probably we will just build AlphaZero so that it uses MCTS instead of finding MCTS with gradient descent, for example.
[edit: actually, also I don’t think I get how you’d use a ‘smaller times table’ to oversee a ‘bigger times table’ unless you already knew how arithmetic worked, at which point it’s not obvious why you’re not just writing an arithmetic program.]
IMO this runs into two large classes of problems, both of which I put under the heading ‘ontological collapse’.
First, suppose the agent’s inner objective is internally located: “seek out pleasant tastes.” Then you run into 16 and 17, where you can’t quite be sure what it means by “pleasant tastes”, and you don’t have a great sense of what “pleasant tastes” will extrapolate to at the next level of capabilities. [One running “joke” in EA is that, on some theories of what morality is about, the highest-value universe is one which contains an extremely large number of rat brains on heroin. I think this is the correct extrapolation / maximization of at least one theory which produces good behavior when implemented by humans today, which makes me pretty worried about this sort of extrapolation.]
Second, suppose the agent’s inner objective is externally located: “seek out mom pressing the reward button”. Then you run into 18, which argues that once the agent realizes that the ‘reward button’ is an object in its environment instead of a communication channel between the human and itself, it may optimize for the object instead of ‘being able to hear what the human would freely communicate’ or whatever philosophically complicated variable it is that we care about. [Note that attempts to express this often need multiple patches and still aren’t fixed; “mom approves of you” can be coerced, “mom would freely approve of you” has a trouble where you have some freedom in identifying your concept of ‘mom’ which means you might pick one who happens to approve of you.]
I’m optimistic about this too, but… I want to make sure we’re looking at the same problem, or something? I think my sense is best expressed in Stanovich and West, where they talk about four responses to the presence of systematic human misjudgments. The ‘performance error’ response is basically the ‘epsilon-rationality’ assumption; 1-ε of the time humans make the right call, and ε of the time they make a random call. While a fine model of performance errors, it doesn’t accurately predict what’s happening with systematic errors, which are predictable instead of stochastic.
I sometimes see people suggest that the model should always or never conform to the human’s systematic errors, but it seems to me like we need to somehow distinguish between systematic “errors” that are ‘value judgments’ (“oh, it’s not that the human prefers 5 deaths to 1 death, it’s that they are opposed to this ‘murder’ thing that I should figure out”) and systematic errors that are ‘bounded rationality’ or ‘developmental levels’ (“oh, it’s not that the (very young) human prefers less water to more water, it’s that they haven’t figured out conservation of mass yet”). It seems pretty sad if we embed all of our confusions into the AI forever—and also pretty sad if we end up not able to transfer any values because all of them look like confusions.[2]
[1] This might depend on what sort of curriculum you train it on; I was imagining something like 1) set the number of digits N=1, 2) generate two numbers uniformly at random between 1 and 2^N, pass them as inputs (sequence of digits?), 3) compare the sequence of digits outputted to the correct answer, either with a binary pass/fail or some sort of continuous similarity metric (so it gets some points for 12x12 = 140 or w/e); once it performs at 90% success check the performance for increased N until you get one with below 80% success and continue training. In that scenario, I think it just memorizes until N is moderately sized (8?), at which point it figures out how to multiply, and then you can increasing N lots without losing accuracy (until you hit some overflow error in its implementation of multiplication from having large numbers).
[2] I’m being a little unfair in using the trolley problem as an example of a value judgment, because in my mind the people who think you shouldn’t pull the lever because it’s murder are confused or missing a developmental jump—but I have the sense that for most value judgments we could find, we can find some coherent position which views it as confused in this way.
Re: smooth vs bumpy capabilities, I agree that capabilities sometimes emerge abruptly and unexpectedly. Still, iterative deployment with gradually increasing stakes is much safer than deploying a model to do something totally unprecedented and high-stakes. There are multiple ways to make deployment more conservative and gradual. (E.g., incrementally increase the amount of work the AI is allowed to do without close supervision, incrementally increase the amount of KL-divergence between the new policy and a known-to-be-safe policy.)
Re: ontological collapse, there are definitely some tricky issues here, but the problem might not be so bad with the current paradigm, where you start with a pretrained model (which doesn’t really have goals and isn’t good at long-horizon control), and fine-tune it (which makes it better at goal-directed behavior). In this case, most of the concepts are learned during the pretraining phase, not the fine-tuning phase where it learns goal-directed behavior.
I agree with the “X is safer than Y” claim; I am uncertain whether it’s practically available to us, and much more worried in worlds where it isn’t available.
For this specific proposal, when I reframe it as “give the system a KL-divergence budget to spend on each change to its policy” I worry that it works against a stochastic attacker but not an optimizing attacker; it may be the case that every known-to-be-safe policy has some unsafe policy within a reasonable KL-divergence of it, because the danger can be localized in changes to some small part of the overall policy-space.
Yeah, I agree that this seems pretty good. I do naively guess that when you do the fine-tuning, it’s the concepts that are most related to the goals who change the most (as they have the most gradient pressure on them); it’d be nice to know how much this is the case, vs. most of the relevant concepts being durable parts of the environment that were already very important for goal-free prediction.
To do what, exactly, in this nice iterated fashion, before Facebook AI Research destroys the world six months later? What is the weak pivotal act that you can perform so safely?
When the rater is flawed, cranking up the power to NP levels blows up the P part of the system.
Do alignment & safety research, set up regulatory bodies and monitoring systems.
Not sure exactly what this means. I’m claiming that you can make raters less flawed, for example, by decomposing the rating task, and providing model-generated critiques that help with their rating. Also, as models get more sample efficient, you can rely more on highly skilled and vetted raters.
My read was that for systems where you have rock-solid checking steps, you can throw arbitrary amounts of compute at searching for things that check out and trust them, but if there’s any crack in the checking steps, then things that ‘check out’ aren’t trustable, because the proposer can have searched an unimaginably large space (from the rater’s perspective) to find them. [And from the proposer’s perspective, the checking steps are the real spec, not whatever’s in your head.]
In general, I think we can get a minor edge from “checking AI work” instead of “generating our own work” and that doesn’t seem like enough to tackle ‘cognitive megaprojects’ (like ‘cure cancer’ or ‘develop a pathway from our current society to one that can reliably handle x-risk’ or so on). Like, I’m optimistic about “current human scientists use software assistance to attempt to cure cancer” and “an artificial scientist attempts to cure cancer” and pretty pessimistic about “current human scientists attempt to check the work of an artificial scientist that is attempting to cure cancer.” It reminds me of translators who complained pretty bitterly about being given machine-translated work to ‘correct’; they basically still had to do it all over again themselves in order to determine whether or not the machine had gotten it right, and so it wasn’t nearly as much of a savings as hoped.
Like the value of ‘DocBot attempts to cure cancer’ is that DocBot can think larger and wider thoughts than humans, and natively manipulate an opaque-to-us dense causal graph of the biochemical pathways in the human body, and so on; if you insist on DocBot only thinking legible-to-human thoughts, then it’s not obvious it will significantly outperform humans.
If Facebook AI research is such a threat, wouldn’t it be possible to talk to Yann LeCun?
I did, briefly. I ask that you not do so yourself, or anybody else outside one of the major existing organizations, because I expect that will make things worse as you annoy him and fail to phrase your arguments in any way he’d find helpful.
Other MIRI staff have also chatted with Yann. One co-worker told me that he was impressed with Yann’s clarity of thought on related topics (e.g., he has some sensible, detailed, reductionist models of AI), so I’m surprised things haven’t gone better.
Non-MIRI folks have talked to Yann too; e.g., Debate on Instrumental Convergence between LeCun, Russell, Bengio, Zador, and More.
What happened?
Nothing much.
There was also a debate between Yann and Stuart Russel on facebook, which got discussed here:
https://www.lesswrong.com/posts/WxW6Gc6f2z3mzmqKs/debate-on-instrumental-convergence-between-lecun-russell
For a more comprehensive writeup of some stuff related to the “annoy him and fail to phrase your arguments helpfully”, see Idea Innoculation and Inferential Distance.
My view is that if Yann continues to be interested in arguing about the issue then there’s something to work with, even if he’s skeptical, and the real worry is if he’s stopped talking to anyone about it (I have no idea personally what his state of mind is right now)
Produce the Textbook From The Future that tells us how to do AGI safely. That said, getting an AGI to generate a correct Foom safety textbook or AGI Textbook from the future would be incredibly difficult, it would be very possible for an AGI to slip in a subtle hard-to-detect inaccuracy that would make it worthless, verifying that it is correct would be very difficult, and getting all humans on earth to follow it would be very difficult.