I think that progress in language modeling makes this view look much worse than it did in 2018.
It doesn’t look much worse to me yet. (I’m not sure whether you know things I don’t, or whether we’re reading the situation differently. We could maybe try to bang out specific bets here at some point.)
Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.
For the record, there’s a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can’t do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.
I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those “can you ask me in advance first” moments where I’m happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem “shallow and not much evidence” vs “either evidence that this AI is scary or actively in violation of my model”.
I feel like you’ve got to admit that we’re currently in a world where everyone is building non-self-modifying Oracles that can explain the consequences of their plans
I don’t in fact think that the current levels of “explaining the consequences of their plans” are either impressive in the relevant way, or going to generalize in the relevant way. I do predict that things are going to have to change before the end-game. In response to these observations, my own models are saying “sure, this is the sort of thing that can happen before the end (although obviously some stuff is going to have to change, and it’s no coincidence that the current systems aren’t themselves particularly scary)”, because predicting the future is hard and my models don’t concentrate probability mass all that tightly on the details. It’s plausible to me that I’m supposed to be conceding a bunch of Bayes points to people who think this all falls on a continuum that we’re clearly walking along, but I admit I have some sense that people just point to what actually happened in a shallower way and say “see, that’s what my model predicted” rather that actually calling particulars in advance. (I can recall specific case of Dario predicting some particulars in advance, and I concede Bayes points there. I also have the impression that you put more probability mass here than I did, although fewer specific examples spring to mind, and I concede some fewer Bayes points to you.) I consider it to be some evidence, but not enough to shift me much. Reflecting on why, I think it’s on account of how my models haven’t taken hits that are bigger than they expected to take (on account of all the vaugaries), and how I still don’t know how to make sense of the rest of the world through my-understanding-of your (or Dario’s) lens.
It doesn’t look much worse to me yet. (I’m not sure whether you know things I don’t, or whether we’re reading the situation differently. We could maybe try to bang out specific bets here at some point.)
Which of “being smart,” “being a good person,” and “still being a good person in a Chinese bureaucracy” do you think is hard (prior to having AI smart enough to be dangerous)? Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?
For the record, there’s a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can’t do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.
Eliezer gave an example about identifying which of two changes we care about (“destroying her music collection” and “changes to its own files.”) That kind of example does not seem to involve deep reasoning about consequences-humans-care-about. Eliezer may be using this example in a more deeply allegorical way, but it seems like in this case the allegory has thrown out the important part of the example and I’m not even sure how to turn it into an example that he would stand behind.
I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those “can you ask me in advance first” moments where I’m happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem “shallow and not much evidence” vs “either evidence that this AI is scary or actively in violation of my model”.
You and Eliezer often suggest that particular alignment strategies are doomed because they involve AI solving hard tasks that won’t be doable until it’s too late (as in the quoted comment by Eliezer). I think if you want people to engage with those objections seriously, you should probably say more about what kinds of tasks you have in mind.
My current sense is that nothing is in violation of your model until the end of days. In that case it’s fair enough to say that we shouldn’t update about your model based on evidence. But that also means I’m just not going to find the objection persuasive unless I see more of an argument, or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).
I don’t in fact think that the current levels of “explaining the consequences of their plans” are either impressive in the relevant way, or going to generalize in the relevant way.
I think language models can explain the consequences of their plans insofar as they understand those consequences at all. It seems reasonable for you to say “language models aren’t like the kind of AI systems we are worried about,” but I feel like in that case each unit of progress in language modeling needsto be evidence against your view.
You are predicting that powerful AI will have property X (= can make plans with consequences that they can’t explain). If existing AIs had property X, then that would be evidence for your view. If existing AIs mostly don’t have property X, that must be evidence against your view. The only way it’s a small amount of evidence is if you were quite confident that AIs wouldn’t have property X.
You might say that AlphaZero can make plans with consequences it can’t explain, and so that’s a great example of an AI system with property X (so that language models are evidence against your position, but AlphaZero is evidence in favor). That would seem to correspond to the relatively concrete prediction that AlphaZero’s inability to explain itself is fundamentally hard to overcome, and so it wouldn’t be easy to train a system like AlphaZero that is able to explain the consequences of its actions.
Is that the kind of prediction you’d want to stand behind?
(still travelling; still not going to reply in a ton of depth; sorry. also, this is very off-the-cuff and unreflected-upon.)
Which of “being smart,” “being a good person,” and “still being a good person in a Chinese bureaucracy” do you think is hard (prior to having AI smart enough to be dangerous)?
For all that someone says “my image classifier is very good”, I do not expect it to be able to correctly classify “a screenshot of the code for an FAI” as distinct from everything else. There are some cognitive tasks that look so involved as to require smart-enough-to-be-dangerous capabilities. Some such cognitive tasks can be recast as “being smart”, just as they can be cast as “image classification”. Those ones will be hard without scary capabilities. Solutions to easier cognitive problems (whether cast as “image classification” or “being smart” or whatever) by non-scary systems don’t feel to me like they undermine this model.
“Being good” is one of those things where the fact that a non-scary AI checks a bunch of “it was being good” boxes before some consequent AI gets scary, does not give me much confidence that the consequent AI will also be good, much like how your chimps can check a bunch of “is having kids” boxes without ultimately being an IGF maximizer when they grow up.
My cached guess as to our disageement vis a vis “being good in a Chinese bureaucracy” is whether or not some of the difficult cognitive challenges (such as understanding certain math problems well enough to have insights about them) decompose such that those cognitions can be split across a bunch of non-scary reasoners in a way that succeeds at the difficult cognition without the aggregate itself being scary. I continue to doubt that and don’t feel like we’ve seen much evidence either way yet (but perhaps you know things I do not).
(from the OP:) Yet it seems like GPT-3 already has a strong enough understanding of what humans care about that it could be used for this purpose.
To be clear, I agree that GPT-3 already has strong enough understanding to solve the sorts of problems Eliezer was talking about in the “get my grandma out of the burning house” argument. I read (perhaps ahistorically) the grandma-house argument as being about how specifying precisely what you want is real hard. I agree that AIs will be able to learn a pretty good concept of what we want without a ton of trouble. (Probably not so well that we can just select one of their concepts and have it optimize for that, in the fantasy-world where we can leaf through its concepts and have it optimize for one of them, because of how the empirically-learned concepts are more likely to be like “what we think we want” than “what we would want if we were more who we wished to be” etc. etc.)
Separately, in other contexts where I talk about AI systems understanding the consequences of their actions being a bottleneck, it’s understanding of consequences sufficient for things like fully-automated programming and engineering. Which look to me like they require a lot of understanding-of-consequences that GPT-3 does not yet possess. My “for the record” above was trying to make that clear, but wasn’t making the above point where I think we agree clear; sorry about that.
Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?
It would take a bunch of banging, but there’s probably some sort of “the human engineer can stare at the engineering puzzle and tell you the solution (by using thinking-about-consequences in the manner that seems to me to be tricky)” that I doubt an AI can replicate before being pretty close to being a good engineer. Or similar with, like, looking at a large amount of buggy code (where fixing the bug requires understanding some subtle behavior of the whole system) and then telling you the fix; I doubt an AI can do that before it’s close to being able to do the “core” cognitive work of computer programming.
It seems reasonable for you to say “language models aren’t like the kind of AI systems we are worried about,” but I feel like in that case each unit of progress in language modeling needs to be evidence against your view.
Maybe somewhat? My models are mostly like “I’m not sure how far language models can get, but I don’t think they can get to full-auto programming or engineering”, and when someone is like “well they got a little farther (although not as far as you say they can’t)!”, it does not feel to me like a big hit. My guess is it feels to you like it should be a bigger hit, because you’re modelling the skills that copilot currently exhibits as being more on-a-continuum with the skills I don’t expect language models can pull off, and so any march along the continuum looks to you like it must be making me sweat?
If things like copilot smoothly increase in “programming capability” to the point that they can do fully-automated programming of complex projects like twitter, then I’d be surprised.
I still lose a few Bayes points each day to your models, which more narrowly predict that we’ll take each next small step, whereas my models are more uncertain and say “for all I know, today is the day that language models hit their wall”. I don’t see the ratios as very large, though.
or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).
A man can dream. We may yet be able to find one, though historically when we’ve tried it looks to me like we are mostly reading the same history in different ways, which makes things tricky.
My specific prediction: “chain of thought” style approaches scale to (at least) human level AGI. The most common way in which these systems will be able to self-modify is by deliberately choosing their own finetuning data. They’ll also be able to train new and bigger models with different architectures, but the primary driver of capabilities increases will be increasing the compute used for such models, not new insights from the AGIs.
I would love for you two to bet, not necessarily because of epistemic hygiene, but because I don’t know who to believe here and I think betting would enumerate some actual predictions about AGI development that might clarify for me how exactly you two disagree in practice.
It doesn’t look much worse to me yet. (I’m not sure whether you know things I don’t, or whether we’re reading the situation differently. We could maybe try to bang out specific bets here at some point.)
For the record, there’s a breed of reasoning-about-the-consequences-humans-care-about that I think GPT-3 relevantly can’t do (related to how GPT-3 is not in fact scary), and the shallower analog it can do does not seem to me to undermine what-seems-to-me-to-be-the-point in the quoted text.
I acknowledge this might be frustrating to people who think that these come on an obvious continuum that GPT-3 is obviously walking along. This looks to me like one of those “can you ask me in advance first” moments where I’m happy to tell you (in advance of seeing what GPT-N can do) what sorts of predicting-which-consequences-humans-care-about I would deem “shallow and not much evidence” vs “either evidence that this AI is scary or actively in violation of my model”.
I don’t in fact think that the current levels of “explaining the consequences of their plans” are either impressive in the relevant way, or going to generalize in the relevant way. I do predict that things are going to have to change before the end-game. In response to these observations, my own models are saying “sure, this is the sort of thing that can happen before the end (although obviously some stuff is going to have to change, and it’s no coincidence that the current systems aren’t themselves particularly scary)”, because predicting the future is hard and my models don’t concentrate probability mass all that tightly on the details. It’s plausible to me that I’m supposed to be conceding a bunch of Bayes points to people who think this all falls on a continuum that we’re clearly walking along, but I admit I have some sense that people just point to what actually happened in a shallower way and say “see, that’s what my model predicted” rather that actually calling particulars in advance. (I can recall specific case of Dario predicting some particulars in advance, and I concede Bayes points there. I also have the impression that you put more probability mass here than I did, although fewer specific examples spring to mind, and I concede some fewer Bayes points to you.) I consider it to be some evidence, but not enough to shift me much. Reflecting on why, I think it’s on account of how my models haven’t taken hits that are bigger than they expected to take (on account of all the vaugaries), and how I still don’t know how to make sense of the rest of the world through my-understanding-of your (or Dario’s) lens.
Which of “being smart,” “being a good person,” and “still being a good person in a Chinese bureaucracy” do you think is hard (prior to having AI smart enough to be dangerous)? Does that correspond to some prediction about the kind of imitation task that will prove difficult for AI?
Eliezer gave an example about identifying which of two changes we care about (“destroying her music collection” and “changes to its own files.”) That kind of example does not seem to involve deep reasoning about consequences-humans-care-about. Eliezer may be using this example in a more deeply allegorical way, but it seems like in this case the allegory has thrown out the important part of the example and I’m not even sure how to turn it into an example that he would stand behind.
You and Eliezer often suggest that particular alignment strategies are doomed because they involve AI solving hard tasks that won’t be doable until it’s too late (as in the quoted comment by Eliezer). I think if you want people to engage with those objections seriously, you should probably say more about what kinds of tasks you have in mind.
My current sense is that nothing is in violation of your model until the end of days. In that case it’s fair enough to say that we shouldn’t update about your model based on evidence. But that also means I’m just not going to find the objection persuasive unless I see more of an argument, or else some way of grounding out the objection in intuitions that do make some different prediction about something we actually observe (either in the interim or historically).
I think language models can explain the consequences of their plans insofar as they understand those consequences at all. It seems reasonable for you to say “language models aren’t like the kind of AI systems we are worried about,” but I feel like in that case each unit of progress in language modeling needs to be evidence against your view.
You are predicting that powerful AI will have property X (= can make plans with consequences that they can’t explain). If existing AIs had property X, then that would be evidence for your view. If existing AIs mostly don’t have property X, that must be evidence against your view. The only way it’s a small amount of evidence is if you were quite confident that AIs wouldn’t have property X.
You might say that AlphaZero can make plans with consequences it can’t explain, and so that’s a great example of an AI system with property X (so that language models are evidence against your position, but AlphaZero is evidence in favor). That would seem to correspond to the relatively concrete prediction that AlphaZero’s inability to explain itself is fundamentally hard to overcome, and so it wouldn’t be easy to train a system like AlphaZero that is able to explain the consequences of its actions.
Is that the kind of prediction you’d want to stand behind?
(still travelling; still not going to reply in a ton of depth; sorry. also, this is very off-the-cuff and unreflected-upon.)
For all that someone says “my image classifier is very good”, I do not expect it to be able to correctly classify “a screenshot of the code for an FAI” as distinct from everything else. There are some cognitive tasks that look so involved as to require smart-enough-to-be-dangerous capabilities. Some such cognitive tasks can be recast as “being smart”, just as they can be cast as “image classification”. Those ones will be hard without scary capabilities. Solutions to easier cognitive problems (whether cast as “image classification” or “being smart” or whatever) by non-scary systems don’t feel to me like they undermine this model.
“Being good” is one of those things where the fact that a non-scary AI checks a bunch of “it was being good” boxes before some consequent AI gets scary, does not give me much confidence that the consequent AI will also be good, much like how your chimps can check a bunch of “is having kids” boxes without ultimately being an IGF maximizer when they grow up.
My cached guess as to our disageement vis a vis “being good in a Chinese bureaucracy” is whether or not some of the difficult cognitive challenges (such as understanding certain math problems well enough to have insights about them) decompose such that those cognitions can be split across a bunch of non-scary reasoners in a way that succeeds at the difficult cognition without the aggregate itself being scary. I continue to doubt that and don’t feel like we’ve seen much evidence either way yet (but perhaps you know things I do not).
To be clear, I agree that GPT-3 already has strong enough understanding to solve the sorts of problems Eliezer was talking about in the “get my grandma out of the burning house” argument. I read (perhaps ahistorically) the grandma-house argument as being about how specifying precisely what you want is real hard. I agree that AIs will be able to learn a pretty good concept of what we want without a ton of trouble. (Probably not so well that we can just select one of their concepts and have it optimize for that, in the fantasy-world where we can leaf through its concepts and have it optimize for one of them, because of how the empirically-learned concepts are more likely to be like “what we think we want” than “what we would want if we were more who we wished to be” etc. etc.)
Separately, in other contexts where I talk about AI systems understanding the consequences of their actions being a bottleneck, it’s understanding of consequences sufficient for things like fully-automated programming and engineering. Which look to me like they require a lot of understanding-of-consequences that GPT-3 does not yet possess. My “for the record” above was trying to make that clear, but wasn’t making the above point where I think we agree clear; sorry about that.
It would take a bunch of banging, but there’s probably some sort of “the human engineer can stare at the engineering puzzle and tell you the solution (by using thinking-about-consequences in the manner that seems to me to be tricky)” that I doubt an AI can replicate before being pretty close to being a good engineer. Or similar with, like, looking at a large amount of buggy code (where fixing the bug requires understanding some subtle behavior of the whole system) and then telling you the fix; I doubt an AI can do that before it’s close to being able to do the “core” cognitive work of computer programming.
Maybe somewhat? My models are mostly like “I’m not sure how far language models can get, but I don’t think they can get to full-auto programming or engineering”, and when someone is like “well they got a little farther (although not as far as you say they can’t)!”, it does not feel to me like a big hit. My guess is it feels to you like it should be a bigger hit, because you’re modelling the skills that copilot currently exhibits as being more on-a-continuum with the skills I don’t expect language models can pull off, and so any march along the continuum looks to you like it must be making me sweat?
If things like copilot smoothly increase in “programming capability” to the point that they can do fully-automated programming of complex projects like twitter, then I’d be surprised.
I still lose a few Bayes points each day to your models, which more narrowly predict that we’ll take each next small step, whereas my models are more uncertain and say “for all I know, today is the day that language models hit their wall”. I don’t see the ratios as very large, though.
A man can dream. We may yet be able to find one, though historically when we’ve tried it looks to me like we are mostly reading the same history in different ways, which makes things tricky.
My specific prediction: “chain of thought” style approaches scale to (at least) human level AGI. The most common way in which these systems will be able to self-modify is by deliberately choosing their own finetuning data. They’ll also be able to train new and bigger models with different architectures, but the primary driver of capabilities increases will be increasing the compute used for such models, not new insights from the AGIs.
I would love for you two to bet, not necessarily because of epistemic hygiene, but because I don’t know who to believe here and I think betting would enumerate some actual predictions about AGI development that might clarify for me how exactly you two disagree in practice.