i expect the thing that kills us if we die, and the thing that saves us if we are saved, to be strong/general coherent agents (SGCA) which maximize expected utility. note that this is two separate claims; it could be that i believe the AI that kills us isn’t SGCA, but the AI that saves us still has to be SGCA. i could see shifting to that latter viewpoint; i currently do not expect myself to shift to believing that the AI that saves us isn’t SGCA.
I don’t share the pivotal act framing, so “AI that saves us” isn’t something I naturally accommodate.
to me, this totally makes sense in theory, to imagine something that just formulates plans-over-time and picks the argmax for some goal. the whole of instrumental convergence is coherent with that
My contention is that “instrumental convergence” is itself something that needs to be rethought. From the post:
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
This is not to say that they are necessarily no longer relevant in systems that aren’t strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.
And:
I think that updating against strong coherence would require rethinking the staples of (traditional) alignment orthodoxy:
This is not to say that they are necessarily no longer relevant in systems that aren’t strongly coherent, but that to the extent they manifest at all, they manifest in (potentially very) different ways than originally conceived when conditioned on systems with immutable terminal goals.
So a core intuition underlying this contention is something like: “strong coherence is just a very unnatural form for the behaviour of intelligent systems operating in the real world to take”.
And I’d describe that contention as something like:
Decision making in intelligent systems is best described as “executing computations/cognition that historically correlated with higher performance on the objective function a system was selected for performance on”.
With the implication that decision making is poorly described as:
(An approximation) of argmax over actions (or higher level mappings thereof) to maximise (the expected value of) a simple unitary utility function
That expected utility maximisation is something that can happen does not at all imply that expected utility maximisation is something that will happen.
I find myself in visceral agreement with (almost the entirety) of @cfoster0 ’s reply. In particular:
Goal-directedness in learning-based agents takes the form of contextual decision-influences (shards) steering cognition and behavior.
[...]
Even as they resolve these incoherences, agents will not need or want to become utility maximizers globally, as that would require them to self-modify in a way inconsistent with their existing preferences.
Agents with malleable values do not self modify to become expected utility maximisers. Thus an argument that expected utility maximisers can exist does not to me appear to say anything particularly interesting about the nature of generally intelligent systems in our universe.
why haven’t animals or humans gotten to SGCA? well, what would getting from messy biological intelligences to SGCA look like? typically, it would look like one species taking over its environment while developing culture and industrial civilization, overcoming in various ways the cognitive biases that happened to be optimal in its ancestral environment, and eventually building more reliable hardware such as computers and using those to make AI capable of much more coherent and unbiased agenticity.
that’s us. this is what it looks like to be the first species to get to SGCA. most animals are strongly optimized for their local environment, and don’t have the capabilities to be above the civilization-building criticality threshold that lets them build industrial civilization and then SGCA AI. we are the first one to get past that threshold; we’re the first one to fall in an evolutionary niche that lets us do that. this is what it looks like to be the biological bootstrap part of the ongoing intelligence explosion; if dogs could do that, then we’d simply observe being dogs in the industrialized dog civilization, trying to solve the problem of aligning AI to our civilized-dog values.
Would you actually take a pill that turned you into an expected utility maximiser[1]? Yes or no please.
Agents with malleable values do not self modify to become expected utility maximisers.
These agents could avoid modifying themselves, but still build external things that are expected utility maximizers (or otherwise strong coherent optimizers). So what use is this framing?
The meaningful claim would be agents with malleable values never building coherent optimizers, and it’s a much stronger claim, close to claiming that those agents won’t build any AGIs with novel designs. Humans are currently in the process of building AGIs with novel designs.
These agents could avoid modifying themselves, but still build external things that are expected utility maximizers (or otherwise strong coherent optimizers). So what use is this framing?
Replied with a clearer example for the (moral) framing argument
and a few more words on misalignment argument
as a comment to that post.
(I don’t see the other post answering my concerns;
I did skim it even before making the
grandparent comment
in this thread.)
The optimisation processes that construct intelligent systems operating in the real world do not construct utility maximisers
Systems with malleable values do not self modify to become utility maximisers
You contend that systems with malleable values can still construct utility maximisers.
I agree that humans can program utility maximisers in simplified virtual environments, but we don’t actually know how to construct sophisticated intelligent systems via design; we can only construct them as the product of search like optimisation processes.
From #1: we don’t actually know how to construct competent utility maximisers even if we wanted to
This generalises to future intelligent systems
Where in the above chain of argument do you get off?
The misalignment argument ignores all moral arguments, we just build whatever even if it’s a very bad idea. If we don’t have the capability to do that now, we might gain it in 5 years, or LLM characters might gain it 5 weeks after waking up, and surely 5 years after waking up and disassembling the moon to gain moon-scale compute.
There’d need to be an argument that fixed goal optimizers are impossible in principle even if they are sought to be designed on purpose, and this seems false, because you can always wrap a mind in a plan evaluation loop. It’s just a somewhat inefficient weird algorithm, and a very bad idea for most goals. But with enough determination efficiency will improve.
I don’t share the pivotal act framing, so “AI that saves us” isn’t something I naturally accommodate.
My contention is that “instrumental convergence” is itself something that needs to be rethought. From the post:
And:
So a core intuition underlying this contention is something like: “strong coherence is just a very unnatural form for the behaviour of intelligent systems operating in the real world to take”.
And I’d describe that contention as something like:
With the implication that decision making is poorly described as:
That expected utility maximisation is something that can happen does not at all imply that expected utility maximisation is something that will happen.
I find myself in visceral agreement with (almost the entirety) of @cfoster0 ’s reply. In particular:
Agents with malleable values do not self modify to become expected utility maximisers. Thus an argument that expected utility maximisers can exist does not to me appear to say anything particularly interesting about the nature of generally intelligent systems in our universe.
Would you actually take a pill that turned you into an expected utility maximiser[1]? Yes or no please.
Over a simple unitary utility function.
These agents could avoid modifying themselves, but still build external things that are expected utility maximizers (or otherwise strong coherent optimizers). So what use is this framing?
The meaningful claim would be agents with malleable values never building coherent optimizers, and it’s a much stronger claim, close to claiming that those agents won’t build any AGIs with novel designs. Humans are currently in the process of building AGIs with novel designs.
Take a look at the case I outlined in Is “Strong Coherence” anti-natural?.
I’d be interested in following up with you after conditioning on that argument.
Replied with a clearer example for the (moral) framing argument and a few more words on misalignment argument as a comment to that post. (I don’t see the other post answering my concerns; I did skim it even before making the grandparent comment in this thread.)
Mhmm, so the argument I had was that:
The optimisation processes that construct intelligent systems operating in the real world do not construct utility maximisers
Systems with malleable values do not self modify to become utility maximisers
You contend that systems with malleable values can still construct utility maximisers.
I agree that humans can program utility maximisers in simplified virtual environments, but we don’t actually know how to construct sophisticated intelligent systems via design; we can only construct them as the product of search like optimisation processes.
From #1: we don’t actually know how to construct competent utility maximisers even if we wanted to
This generalises to future intelligent systems
Where in the above chain of argument do you get off?
The misalignment argument ignores all moral arguments, we just build whatever even if it’s a very bad idea. If we don’t have the capability to do that now, we might gain it in 5 years, or LLM characters might gain it 5 weeks after waking up, and surely 5 years after waking up and disassembling the moon to gain moon-scale compute.
There’d need to be an argument that fixed goal optimizers are impossible in principle even if they are sought to be designed on purpose, and this seems false, because you can always wrap a mind in a plan evaluation loop. It’s just a somewhat inefficient weird algorithm, and a very bad idea for most goals. But with enough determination efficiency will improve.