I do think that most current alignment work does also advance capabilities, but that the distinction should mostly be ‘clear’ even if there are importantly shades of gray and you cannot precisely define a seperator.
For a large subclass of work, I actually disagree with this claim, and I think areas where you wouldn’t accelerate capabilities with alignment work is work on reducing deceptively aligned AI takeover risk/sharp left turn risk, for the reasons that @RogerDearnaley stated here:
So that’s the thing, right? Fictional worlds like this almost never actually make sense on closer examination. The incentives and options and actions are based on the plot and the need to tell human stories rather than following good in-universe logic. That the worlds in question are almost always highly fragile, the worlds really should blow up, and the AIs ensure the humans work out okay in some sense ‘because of reasons’ because it feels right to a human writer and their sense of morality or something rather than that this would happen.
I worry this kind of perspective is load bearing, given he thinks it is ‘correctly predicting the future,’ the idea that ‘prosaic alignment’ will result in sufficiently strong pushes to doing some common sense morality style not harming of the humans, despite all the competitive dynamics among AIs and various other things they value and grow to value, that things turn out fine by default, in worlds that to me seem past their point of no return and infinitely doomed unless you think the AIs themselves have value.
I think one key crux here is whether you think partial alignment successes are possible. If AI alignment ends up binary, I would agree that Her is basically an incoherent description of the future.
If AI alignment ends up more of a continuous quantity such that reasonably large partial successes are probable, then Her is more coherent as a plausible future.
I admit I tend to favor the continuous side of the debate more than the discrete side, and tend to see discreteness as an abstraction over the actual continuous outcomes.
To be clear, Her is not a totally coherent story, but I do think that relatively minor changes are enough to make it more coherent.
On this:
Without getting into any technical arguments, it seems rather absurd to suggest the set of goals that imply undesired subgoals within plausibly desired goal space would have measure zero? I don’t see how this survives contact with common sense or relation to human experience or typical human situations.
The answer is because probabilities get weird in infinity, because probability can be 0% on an event happening, even if the event happening is possible, and probability can be 100% on an event happening, even if there are scenarios where the event can happen.
One perfect example is that if you pick a random real number, you will have 0% on getting any real number, no matter which it is, and if you try to search for an irrational number, you will have a 100% chance of getting an irrational real number, but that doesn’t mean you can’t sample a rational number.
I don’t think that it is actually measure 0, but that’s for different reasons.
My response to Joshua Achiam’s point on why we get instrumental convergence is o1 exists.
While it often doesn’t want to use it’s instrumental convergence, it does have the capability to cause basic instrumental convergence, and in AI scaling, noisy capabilites foreshadow robust and powerful capabilities.
Indeed, I would go further. The market wants the AIs to be given as much freedom and authority as possible, to send them out to compete for resources and influence generally, for various ultimate purposes. And the outcome of those clashes and various selection effects and resource competitions, by default, dooming us.
I think it depends on whether we are assessing it using X-risk standards or GCR standards.
On the one hand, I absolutely believe that we could well get into GCR territory, for something like this reason below:
But I don’t think it will get into X-risk territory, because I both expect AI agents to be more controlled than the molochian story tells us, and also because I expect some winners who will be able to go to the stars and have their values imprint on the stars.
Sixteen years ago, Eliezer Yudkowsky wrote the Value Theory sequence, going deep on questions like what makes things have value to us, how to reconcile when different entities (human or otherwise) have very different values, and so on. If you’re interested in these questions, this is a great place to start. I have often tried to emphasize that I continue to believe that Value is Fragile, whereas many who don’t believe in existential risk think value is not fragile.
It is a highly understood problem among our crowd that ‘human values’ is both very complex and a terrifyingly hard thing to pin down, and that people very strongly disagree about what they value.
Also it is a terrifyingly easy thing to screw up accidentally, and we have often said that this is one of the important ways to build AGI and lose – that you choose a close and well-meaning but incorrect specification of values, or your chosen words get interpreted that way, or someone tries to get the AGI to find those values by SGD or other search and it gets a well-meaning but incorrect specification.
I actually don’t agree with both Joshua Achiam and a lot of LWers on the assumption that value is complicated (at least in generative structure), and also don’t think this failure mode, conditional on no deceptive/proxy misalignment is likely to happen at all, because capabilities people are incentivized to make our specifications better, and I also think that the function of human values are actually fairly simple, when we abstract away the complicated mechanisms.
I agree with the literal claim in Value is Fragile post that this is probably true with reasonably high probability, but not so much with the other conclusions often held in the cluster.
Also, I think one very large meta-crux that underlies a lot of other cruxes is whether the best strategy is to do locally optimal things and iterate, closest to the model-free RL approaches, or whether you should just build an explicit model and optimize in the model, closest to model-based RL/search approaches.
I think this is a meta-crux underlying all other disagreements in strategies for AI alignment, and I think unfortunately we will probably have to solve this the hard way by executing both strategies in reality in parallel, to see which one wins out as AI progresses.
Edit:
I have to say something about this:
That is how every law and every treaty or agreement works, and indeed the only way they can work.
No, this isn’t how every treaty works.
Treaty violations amongst states are usually not enforced with the threat of war, for fairly obvious reasons. Instead, they are settled some other way.
For a large subclass of work, I actually disagree with this claim, and I think areas where you wouldn’t accelerate capabilities with alignment work is work on reducing deceptively aligned AI takeover risk/sharp left turn risk, for the reasons that @RogerDearnaley stated here:
https://www.lesswrong.com/posts/JviYwAk5AfBR7HhEn/how-to-control-an-llm-s-behavior-why-my-p-doom-went-down-1#jaqADvsmqmqMKRimH
I think one key crux here is whether you think partial alignment successes are possible. If AI alignment ends up binary, I would agree that Her is basically an incoherent description of the future.
If AI alignment ends up more of a continuous quantity such that reasonably large partial successes are probable, then Her is more coherent as a plausible future.
I admit I tend to favor the continuous side of the debate more than the discrete side, and tend to see discreteness as an abstraction over the actual continuous outcomes.
To be clear, Her is not a totally coherent story, but I do think that relatively minor changes are enough to make it more coherent.
On this:
The answer is because probabilities get weird in infinity, because probability can be 0% on an event happening, even if the event happening is possible, and probability can be 100% on an event happening, even if there are scenarios where the event can happen.
One perfect example is that if you pick a random real number, you will have 0% on getting any real number, no matter which it is, and if you try to search for an irrational number, you will have a 100% chance of getting an irrational real number, but that doesn’t mean you can’t sample a rational number.
I don’t think that it is actually measure 0, but that’s for different reasons.
My response to Joshua Achiam’s point on why we get instrumental convergence is o1 exists.
While it often doesn’t want to use it’s instrumental convergence, it does have the capability to cause basic instrumental convergence, and in AI scaling, noisy capabilites foreshadow robust and powerful capabilities.
I think it depends on whether we are assessing it using X-risk standards or GCR standards.
On the one hand, I absolutely believe that we could well get into GCR territory, for something like this reason below:
https://www.lesswrong.com/posts/2ujT9renJwdrcBqcE/the-benevolence-of-the-butcher
But I don’t think it will get into X-risk territory, because I both expect AI agents to be more controlled than the molochian story tells us, and also because I expect some winners who will be able to go to the stars and have their values imprint on the stars.
I actually don’t agree with both Joshua Achiam and a lot of LWers on the assumption that value is complicated (at least in generative structure), and also don’t think this failure mode, conditional on no deceptive/proxy misalignment is likely to happen at all, because capabilities people are incentivized to make our specifications better, and I also think that the function of human values are actually fairly simple, when we abstract away the complicated mechanisms.
I agree with the literal claim in Value is Fragile post that this is probably true with reasonably high probability, but not so much with the other conclusions often held in the cluster.
Also, I think one very large meta-crux that underlies a lot of other cruxes is whether the best strategy is to do locally optimal things and iterate, closest to the model-free RL approaches, or whether you should just build an explicit model and optimize in the model, closest to model-based RL/search approaches.
Cf these comments:
https://www.lesswrong.com/posts/gmHiwafywFo33euGz/aligned-foundation-models-don-t-imply-aligned-systems#xoe3e8uJp4xnx4okn
https://www.lesswrong.com/posts/8A6wXarDpr6ckMmTn/another-argument-against-utility-centric-alignment-paradigms#8KMQEYCbyQoLccbPa
I think this is a meta-crux underlying all other disagreements in strategies for AI alignment, and I think unfortunately we will probably have to solve this the hard way by executing both strategies in reality in parallel, to see which one wins out as AI progresses.
Edit:
I have to say something about this:
No, this isn’t how every treaty works.
Treaty violations amongst states are usually not enforced with the threat of war, for fairly obvious reasons. Instead, they are settled some other way.