I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we’re talking about isn’t made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we’re primarily addressing would clarify discussions.
You’ve done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope’s conclusions.
Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.
That is not the only problem with that essay, but it’s a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.
I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.
While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.
As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.
Seth I think another way to reframe this is to think of an alignment tax.
Utility = ( AI capability) * alignment loss.
Previous doom arguments were that all alignment was impossible, you could not build a machine with near human intelligence that was aligned. Aligned in this context means “acts to further the most probable interpretation of the users instructions”.
Nora et al and you concede above it is possible you can build machines with +- human intelligence and they are aligned per the above definition. So now the relationship becomes:
(Utility of most powerful ASI that current compute can find and run)* available resources ⇔ (utility of most powerful tool AI) * available resources.
In worlds where the less capable tool AIs, which are probably myopic “bureaucracies” of thousands of separate modules, times their resources have more total utility, some humans win.
In worlds where the most powerful actors give unrestricted models massive resources, or unrestricted models provide an enormous utility gain, that’s doom.
If the “alignment tax” is huge, humans eventually always lose. Political campaigning buys a little time but it’s a terminal situation for humans. While humans win some of the worlds where the tax is small.
I agree that alignment taxes are a crucial factor in the odds of getting an alignment plan implemented. That’s why I’m focused on finding and developing promising alignment plans with low taxes.
I think this is one particularly striking example of a common problem in alignment discussions: they are confused when the type of AI we’re talking about isn’t made clear. I think this is a ubiquitous problem in alignment discussions: people are thinking of different types of AI without explicitly stating this, so they reach different conclusions about alignment. To some extent this is inevitable if we want to avoid advancing capabilities by proposing useful designs for AGI. But we could do better by distinguishing between known broad categories, in particular, agentic vs. tool AI and RL-trained vs. predictive AI. These are not sharp categories, but distinguishing what part of the spectrum we’re primarily addressing would clarify discussions.
You’ve done an admirable job of doing that in this post, and doing so seems to make sense of your disagreements with Pope’s conclusions.
Pope appears to be talking primarily about LLMs, so the extent to which his logic applies to other forms of AI is unclear. As you note, that logic does not seem to apply to AI that is agentic (explicitly goal-directed), or to actor-critic RL agents.
That is not the only problem with that essay, but it’s a big one, since the essay comes to the conclusion that AI is safe, while analyzing only one type of AI.
I agree that human ethics is not the result solely of training, but has a critical component of innate drives to be pro-social. The existence of sociopaths whose upbringing was normal is pretty compelling evidence that the genetic component is causal.
While the genetic basis of prosocial behavior is probably simple in the sense that it is coded in a limited amount of DNA information and neural circuitry, it is likely quite complex in another sense: it is evolved to work properly in the context of a very particular type of environment, that of standard human experience. As such, I find it unlikely that those mechanisms would produce an aligned agent in a very different AI training regime, nor that that alignment would generalize to very different situations than humans commonly encounter.
As you note, even if we restricted ourselves to this type of AI, and alignment was easy, that would not reduce existential risks to near 1%. If powerful AI is accessible to many, someone is going to either make mistakes or deliberately use it destructively, probably rather quickly.
Seth I think another way to reframe this is to think of an alignment tax.
Utility = ( AI capability) * alignment loss.
Previous doom arguments were that all alignment was impossible, you could not build a machine with near human intelligence that was aligned. Aligned in this context means “acts to further the most probable interpretation of the users instructions”.
Nora et al and you concede above it is possible you can build machines with +- human intelligence and they are aligned per the above definition. So now the relationship becomes:
(Utility of most powerful ASI that current compute can find and run)* available resources ⇔ (utility of most powerful tool AI) * available resources.
In worlds where the less capable tool AIs, which are probably myopic “bureaucracies” of thousands of separate modules, times their resources have more total utility, some humans win.
In worlds where the most powerful actors give unrestricted models massive resources, or unrestricted models provide an enormous utility gain, that’s doom.
If the “alignment tax” is huge, humans eventually always lose. Political campaigning buys a little time but it’s a terminal situation for humans. While humans win some of the worlds where the tax is small.
Agree/disagree? Does this fit your model?
I agree that alignment taxes are a crucial factor in the odds of getting an alignment plan implemented. That’s why I’m focused on finding and developing promising alignment plans with low taxes.