That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you’re saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn’t route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like “I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human” and squinting.
Attempting to articulate the argument that I can half-see: on Matthew’s model of past!Nate’s model, AI was supposed to have a hard time answering questions like “Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?” without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and… nope, that one fell back into the “Matthew thinks Nate thought getting the AI to understand human values was hard” hypothesis.
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to “we can’t rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than “human-level at moral judgement” to avoid a catastrophe”, though I think that your whole framing is off and that you’re missing a few things:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
You can’t figure out what’s robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
In other words: It’s not that you need a super-ethicist, it’s that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
In other other words: a human’s ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.
This still doesn’t feel quite like it’s getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven’t dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like “the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down” and “suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question”. Which, as separate from the question of whether that’s a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans’ ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I’m arguing,
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
I have a quick response to what I see as your primary objection:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you’ll find that it’s cognizant of many nuances in human morality that go way deeper than the moral question of whether to “call 911 when Alice is in labor and your car has a flat”. Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”. I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can’t, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won’t be capable of performing in the near future, if you think that they are not capable of the ‘deep’ value specification that you care about. And here, again, I’m looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won’t be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it’s difficult for me to interpret your disagreement without a little more insight into what you’re predicting.
I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don’t understand the relevance of this claim to my argument.)
Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven’t tried to answer your request for a prediction.)
Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”.
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
If ordinary humans can’t single out concepts that are robustly worth optimizing for, then either,
Human beings in general cannot single out what is robustly worth optimizing for
Only extraordinary humans can single out what is robustly worth optimizing for
Can you be more clear about which of these you believe?
I’m also including “indirect” ways that humans can single out concepts that are robustly worth optimizing for. But then I’m allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you’re allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can’t single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
If you allow indirection and don’t worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI’s imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N’s human-model and saying “whatever that thing would think is worth optimizing for” probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N’s model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don’t think the “value learning” problem is all that hard, if you’re allowed to assume that indirection works. The difficulty isn’t that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion’s share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I’ve generally pointed out how values are fragile, because that’s an inferentially-first step to most audiences (and a problem to which many people’s mind seems to quickly leap), on an inferential path that later includes “use indirection” (and later “first aim for a minimal pivotal task instead”). But separately, my own top guess is that “use indirection” is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.
That helps somewhat, thanks! (And sorry for making you repeat yourself before discarding the erroneous probability-mass.)
I still feel like I can only barely maybe half-see what you’re saying, and only have a tenuous grasp on it.
Like: why is it supposed to matter that GPT can solve ethical quandries on-par with its ability to perform other tasks? I can still only half-see an answer that doesn’t route through the (apparently-disbelieved-by-both-of-us) claim that I used to argue that getting the AI to understand ethics was a hard bit, by staring at sentences like “I am saying that the system is able to transparently pinpoint to us which outcomes are good and which outcomes are bad, with fidelity approaching an average human” and squinting.
Attempting to articulate the argument that I can half-see: on Matthew’s model of past!Nate’s model, AI was supposed to have a hard time answering questions like “Alice is in labor and needs to be driven to the hospital. Your car has a flat tire. What do you do?” without lots of elbow-grease, and the fact that GPT can answer those questions as a side-effect of normal training means that getting AI to understand human values is easy, contra past!Nate, and… nope, that one fell back into the “Matthew thinks Nate thought getting the AI to understand human values was hard” hypothesis.
Attempting again: on Matthew’s model of past!Nate’s model, getting an AI to answer the above sorts of questions properly was supposed to take a lot of elbow grease. But it doesn’t take a lot of elbow grease, which suggests that values are much easier to lift out of human data than past!Nate thought, which means that value is more like “diamond” and less like “a bunch of random noise”, which means that alignment is easier than past!Nate thought (in the <20% of the problem that constitutes “picking something worth optimizing for”).
That sounds somewhat plausible as a theory-of-your-objection given your comment. And updates me towards the last few bullets, above, being the most relevant ones.
Running with it (despite my uncertainty about even basically understanding your point): my reply is kinda-near-ish to “we can’t rely on a solution to the value identification problem that only works as well as a human, and we require a much higher standard than “human-level at moral judgement” to avoid a catastrophe”, though I think that your whole framing is off and that you’re missing a few things:
The hard part of value specification is not “figure out that you should call 911 when Alice is in labor and your car has a flat”, it’s singling out concepts that are robustly worth optimizing for.
You can’t figure out what’s robustly-worth-optimizing-for by answering a bunch of ethical dilemmas to a par-human level.
In other words: It’s not that you need a super-ethicist, it’s that the work that goes into humans figuring out which futures are rad involves quite a lot more than their answers to ethical dilemmas.
In other other words: a human’s ability to have a civilization-of-their-uploads produce a glorious future is not much contained within their ability to answer ethical quandries.
This still doesn’t feel quite like it’s getting at the heart of things, but it feels closer (conditional on my top-guess being your actual-objection this time).
As support for this having always been the argument (rather than being a post-LLM retcon), I recall (but haven’t dug up) various instances of Eliezer saying (hopefully at least somewhere in text) things like “the difficulty is in generalizing past the realm of things that humans can easily thumbs-up or thumbs-down” and “suppose the AI explicitly considers the hypothesis that its objectives are what-the-humans-value, vs what-the-humans-give-thumbs-ups-to; it can test this by constructing an example that looks deceptively good to humans, which the humans will rate highly, settling that question”. Which, as separate from the question of whether that’s a feasible setup in modern paradigms, illustrates that he at least has long been thinking of the problem of value-specification as being about specifying values in a way that holds up to stronger optimization-pressures rather than specifying values to the point of being able to answer ethical quandries in a human-pleasing way.
(Where, again, the point here is not that one needs an inhumanly-good ethicist, but rather that those things which pin down human values are not contained in the humans’ ability to give a thumbs-up or a thumbs-down to ethical dilemmas.)
Thanks for trying to understand my position. I think this interpretation that you gave is closest to what I’m arguing,
I have a quick response to what I see as your primary objection:
I think this is kinda downplaying what GPT-4 is good at? If you talk to GPT-4 at length, I think you’ll find that it’s cognizant of many nuances in human morality that go way deeper than the moral question of whether to “call 911 when Alice is in labor and your car has a flat”. Presumably you think that ordinary human beings are capable of “singling out concepts that are robustly worth optimizing for”. I claim that to the extent ordinary humans can do this, GPT-4 can nearly do this as well, and to the extent it can’t, I expect almost all the bugs to be ironed out in near-term multimodal models.
It would be nice if you made a precise prediction about what type of moral reflection or value specification multimodal models won’t be capable of performing in the near future, if you think that they are not capable of the ‘deep’ value specification that you care about. And here, again, I’m looking for some prediction of the form: humans are able to do X, but LLMs/multimodal models won’t be able to do X by, say, 2028. Admittedly, making this prediction precise is probably hard, but it’s difficult for me to interpret your disagreement without a little more insight into what you’re predicting.
(Insofar as this was supposed to name a disagreement, I do not think it is a disagreement, and don’t understand the relevance of this claim to my argument.)
Nope! At least, not directly, and not in the right format for hooking up to a superintelligent optimization process.
(This seems to me like plausibly one of the sources of misunderstanding, and in particular I am skeptical that your request for prediction will survive it, and so I haven’t tried to answer your request for a prediction.)
If ordinary humans can’t single out concepts that are robustly worth optimizing for, then either,
Human beings in general cannot single out what is robustly worth optimizing for
Only extraordinary humans can single out what is robustly worth optimizing for
Can you be more clear about which of these you believe?
I’m also including “indirect” ways that humans can single out concepts that are robustly worth optimizing for. But then I’m allowing that GPT-N can do that too. Maybe this is where the confusion lies?
If you’re allowing for humans to act in groups and come up with these concepts after e.g. deliberation, and still think that ordinary humans can’t single out concepts that are robustly worth optimizing for, then I think this view is a little silly, although the second interpretation at least allows for the possibility that the future goes well and we survive AGI, and that would be nice to know.
If you allow indirection and don’t worry about it being in the right format for superintelligent optimization, then sufficiently-careful humans can do it.
Answering your request for prediction, given that it seems like that request is still live: a thing I don’t expect the upcoming multimodal models to be able to do: train them only on data up through 1990 (or otherwise excise all training data from our broadly-generalized community), ask them what superintelligent machines (in the sense of IJ Good) should do, and have them come up with something like CEV (a la Yudkowsky) or indirect normativity (a la Beckstead) or counterfactual human boxing techniques (a la Christiano) or suchlike.
Note that this only tangentially a test of the relevant ability; very little of the content of what-is-worth-optimizing-for occurs in Yudkowsky/Beckstead/Christiano-style indirection. Rather, coming up with those sorts of ideas is a response to glimpsing the difficulty of naming that-which-is-worth-optimizing-for directly and realizing that indirection is needed. An AI being able to generate that argument without following in the footsteps of others who have already generated it would be at least some evidence of the AI being able to think relatively deep and novel thoughts on the topic.
Note also that the AI realizing the benefits of indirection does not generally indicate that the AI could serve as a solution to our problem. An indirect pointer to what the humans find robustly-worth-optimizing dereferences to vastly different outcomes than does an indirect pointer to what the AI (or the AI’s imperfect model of a human) finds robustly-worth-optimizing. Using indirection to point a superintelligence at GPT-N’s human-model and saying “whatever that thing would think is worth optimizing for” probably results in significantly worse outcomes than pointing at a careful human (or a suitable-aggregate of humanity), e.g. because subtle flaws in GPT-N’s model of how humans do philosophy or reflection compound into big differences in ultimate ends.
And note for the record that I also don’t think the “value learning” problem is all that hard, if you’re allowed to assume that indirection works. The difficulty isn’t that you used indirection to point at a slow squishy brain instead of hard fast transistors, the (outer alignment) difficulty is in getting the indirection right. (And of course the lion’s share of the overall problem is elsewhere, in the inner-alignment difficulty of being able to point the AI at anything at all.)
When trying to point out that there is an outer alignment problem at all I’ve generally pointed out how values are fragile, because that’s an inferentially-first step to most audiences (and a problem to which many people’s mind seems to quickly leap), on an inferential path that later includes “use indirection” (and later “first aim for a minimal pivotal task instead”). But separately, my own top guess is that “use indirection” is probably the correct high-level resolution to the problems that most people immediatly think of (namely that the task of describing goodness to a computer is an immense one), with of course a devil remaining in the details of doing the indirection properly (and a larger devil in the inner-alignment problem) (and a caveat that, under time-pressure, we should aim for minimial pivotal tasks instead etc.).
I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.