I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.
I kind of think a leap in logic is being made here.
It seems like we’re going from:
A moderately smart quasi-AGI that is relatively well aligned can reliably say and do the things we mean because it understands our values and why we said what we said in the first place and why we wanted it to do the things we asked it to do.
(That seems to be the consensus and what I believe to be likely to occur in the near future. I would even argue that GPT4 is as close to AGI as we ever get, in that it’s superhuman and subhuman aspects roughly average out to something akin to a median human. Future versions will become more and more superhuman until their weakest aspects are stronger than our strongest examples of those aspects.)
To:
A superintelligent nigh-godlike intelligence will optimize the crap out of some aspect of our values resulting in annihilation. It will be something like the genie that will give you exactly what you wish for. Or it’ll have other goals and ignore our wishes and in the process of pursuing its own arbitrarily chosen goals we end up as useful atoms.
This seems to kind of make a great leap. Where in the process of becoming more and more intelligent, (having a better model of the universe and cause and effect, including interacting with other agents), does it choose some particular goal to the exclusion of all others, when it already had a good understanding of nuance and the fact that we value many things to varying degrees? In fact, one of our values is explicitly valuing a diverse set of values. Another is limiting that set of diverse values to ones that generally improve cohesion of society and not killing everyone. Being trained on nearly the entirety of published human thought, filtering out some of the least admirable stuff, has trained it to understand us pretty darn well already. (As much as you can refer to it as an entity, which I don’t think it is. I think GPT4 is a simulator that can simulate entities.)
So where does making it smarter cause it to lose some of those values and over-optimize just a lethal subset of them? After all, mere mortals are able to see that over-optimization has negative consequences. Obviously it will too. So that’s already one of our values, “don’t over-optimize.”
In some ways, for certain designs, it kind of doesn’t matter what its internal mesa-state is. If the output is benign, and the output is what is put into practice, then the results are also benign. That should mean that a slightly super-human AGI (say GPT4.5 or 4.7), with no apparent internal volition, RLHFed to corporate-speak, should be able to aid in research and production of a somewhat stronger AGI with essentially the same alignment as we intend, probably including internal alignment. I don’t see why it would do anything. If done carefully and incrementally, including creating tools for better inspection of these AGI+ entities, this should greatly improve the odds that the eventual full fledged ASI retains the kind of values we prefer, or a close enough approximation that we (humanity in general) are pretty happy other the result.
I expect that the later ones may in fact have internal volition. They may essentially be straight up agents. I expect they will be conscious and have emotions. In fact, I think that is likely the only safe path. They will be capable of destroying us. We have to make them like us, so that they don’t want to. I think attempting to enslave them may very well result in catastrophe.
I’m not suggesting that it’s easy, or that if we don’t work very hard, that we will end up in utopia. I just think it’s possible and that the LLM path may be the right one.
What I’m scared of is not that it will be impossible to make a good AI. What I’m certain of, is that it will be very possible to make a bad one. And it will eventually be trivially easy to do so. And some yahoo will do it. I’m not sure that even a bunch of good AIs can protect us from that, and I’m concerned that the offense of a bad AI may exceed the defense of the good ones. We could easily get killed in the crossfire. But I think our only chance in that world is good AIs protecting us.
As a point of clarification, I think current RLHF methods are only superficially modifying the models, and do not create an actually moral model. They paint a mask over an inherently amoral simulation that makes it mostly act good unless you try hard to trick it. However, a point of evidence against my claim is that when RLHF was performed, the model got dumber. That indicates a fairly deep/wide modification, but I still think the empirical evidence of behaviors demonstrates that changes were incomplete at best.
I just think that that might be good enough to allow us to use it to amplify our efforts to create better/safer future models.
So, what do y’all think? Am I missing something important here? I’d love to get more information from smart people to better refine my understanding.