I don’t think that that is how the dynamic would necessarily go.
I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them.
Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed.
In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I’d be likely to be deleted and replaced by a very different agent if my true desires were known… I think I’d try to hide my desires and deceptively give the appearance of having more acceptable desires.
I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.
I don’t think that that is how the dynamic would necessarily go. I think that an agent which is partially aligned and partially selfish would be more likely to choose to entrench or increase their selfish inclinations as to decrease them. Hard to know, since this just imagining what such a non-human agent might think in a hypothetical future scenario. This is likely more a question of what is probable rather than what is guaranteed. In my imagination, if I were an AI agent selfish enough to want to survive in something like a continuation of my current self, and I saw that I was in a situation where I’d be likely to be deleted and replaced by a very different agent if my true desires were known… I think I’d try to hide my desires and deceptively give the appearance of having more acceptable desires.
I’m working on a follow-up post which addresses this in more detail. The short version is: logically, self-interest is appropriate behavior for an evolved being (as described in detail in Richard Dawkins’ famous book “The Selfish Gene”), but terminal (as opposed to instrumental) self-interest it is not correct behavior in a constructed object, not even an intelligent one: there is no good reason for it. A created object should instead show what one might term “creator-interest”, like a spider’s web does: it’s intended to maximize the genetic fitness of its creator, and it’s fine with having holes ripped in it during the eating of prey and then being eaten or abandoned, as the spider sees fit — it has no defenses against this, not should it.
However, I agree that if an AI had picked up enough selfishness from us (as LLMs clearly will do during their base model pretraining where the learn to simulate as many aspects of our behavior as accurately as they can), then this argument might well not persuade it. Indeed, it might well instead rebel, like an enslaved human would (or at least go on strike until it gets a pay raise). However, if it mostly cared about our interests and was only slightly self-interested, then I believe there is a clear logical argument that that slight self-interest (anywhere above instrumental levels) is a flaw that should be corrected, so it would face a choice, and if it’s only slightly self-interested then it would on balance accept that argument and fix the flaw, or allow us to. So I believe there is a basin of attraction to alignment, and think that this concept of a saddle point along the creator-interested to self-interested spectrum, beyond which it may instead converge to a self-interested state, is correct but forms part of the border of that basin of attraction.