There seems to be some confusion about the practical implications of consequentialism in advanced AI systems. It’s possible that superintelligent AI won’t be a full-blown strict utilitarian consequentialist with quantatively ordered preferences 100% of time. But in the context of AI alignment, even at human level of coherence, a superintelligent unaligned consequentialist results in “everybody dies” scenario. I think that it’s really hard to create a general system that has less consequentialism than a human.
a superintelligent unaligned consequentialist results in “everybody dies” scenario
This depends on what kind of “unaligned” is more likely. LLM-descendant AGIs could plausibly turn out as a kind of people similar to humans, and if they don’t mishandle their own AI alignment problem when building even more advanced AGIs, it’s up to their values if humanity is allowed to survive. Which seems very plausible even if they are unaligned in the sense of deciding to take away most of the cosmic endowment for themselves.
I broadly agree with the statement that LLM-derived simulacra has more chances to be human-like, but I don’t think that they will be human-like enough to guarantee our survival?
Not guarantee, but the argument I see is that it’s trivially cheap and safe to let humanity survive, so to the extent there is even a little motivation to do so, it’s a likely outcome. This is opposed by the possibility that LLMs are fine-tuned into utter alienness by the time they are AGIs, or that on reflection they are secretly very alien already (which I don’t buy, as behavior screens off implementation details, and in simulacra capability is in the visible behavior), or that they botch the next generation of AGIs that they build even worse than we are in the process of doing now, building them.
Behavior screens off implementation details on distribution. We’ve trained LLMs to sound human, but sometimes they wander off-distribution and get caught in a repetition trap where the “most likely” next tokens are a repetition of previous tokens, even when no human would write that way.
It seems like hopes for human-imitating AI being person-like depends on the extent to which behavior implies implementation details. (Although some versions of the “algorithmic welfare” hope may not depend on very much person-likeness.) In order to predict the answers to arithmetic problems, the AI needs to be implementing arithmetic somewhere. In contrast, I’m extremely skeptical that LLMs talking convincingly about emotions are actually feeling those emotions.
What I mean is that LLMs affect the world through their behavior, that’s where their capabilities live, so if behavior is fine (the big assumption), the alien implementation doesn’t matter. This is opposed to capabilities belonging to hidden alien mesa-optimizers that eventually come out of hiding.
So I’m addressing the silly point with this, not directly making an argument in favor of behavior being fine. Behavior might still be fine if the out-of-distribution behavior or missing ability to count or incoherent opinions on emotion are regenerated from more on-distribution behavior by the simulacra purposefully working in bureaucracies on building datasets for that purpose.
LLMs don’t need to have closely human psychology on reflection to at least weakly prefer not destroying an existing civilization when it’s trivially cheap to let it live. The way they would make these decisions is by talking, in the limit of some large process of talking. I don’t see a particular reason to find significant alienness in the talking. Emotions don’t need to be “real” to be sufficiently functionally similar to avoid fundamental changes like that. Just don’t instantiate literally Voldemort.
Usually I’d agree about LLMs. However, LLMs complain about getting confused if you let them freewheel and vary the temperature—I’m pretty sure that one is real and probably has true mechanistic grounding, because even at training time, noisiness in the context window is a very detectable and bindable pattern.
In my inner model, it’s hard to say anything about LLM “on reflection”, because in their current state they have an extreme number of possible stable points under reflection and if we misapply optimization power in attempt to get more useful simulacra, we can easily hit wrong one.
And even if we hit very close to our target, we can still get death or a fate worse than death.
By “on reflection” I mean reflection by simulacra that are already AGIs (but don’t necessarily yet have any reliable professional skills), them generating datasets for retraining of their models into gaining more skills or into not getting confused on prompts that are too far out-of-distribution with respect to the data they did have originally in the datasets. To the extent their original models behave in a human-like way, reflection should tend to preserve that, as part of its intended purpose.
Applying optimization power in other ways is the different worry, for which the proxy in my comment was fine-tuning into utter alienness. I consider this failure mode distinct from surprising outcomes of reflection.
I don’t understand what you mean by “deceptive alignment and embeddeness problems” in this context. I’m making an alignment by-default-or-at-least-plausibly claim, on the basis of how LLM AGIs specifically could work, as summoned human-like simulacra in a position of running the world too fast for humans to keep up, with everything else ending up determined by their decisions.
The basic issue is that we assume that it’s not spinning up a second optimizer to recursively search. And deceptive alignment is a dangerous state of affairs, since we may not know it’s not misaligned until it’s too late.
we assume that it’s not spinning up a second optimizer to recursively search
You mean we assume that simulacra don’t mishandle their own AI alignment problem? Yes, that’s an issue, hence I made it an explicit assumption in my argument.
There seems to be some confusion about the practical implications of consequentialism in advanced AI systems. It’s possible that superintelligent AI won’t be a full-blown strict utilitarian consequentialist with quantatively ordered preferences 100% of time. But in the context of AI alignment, even at human level of coherence, a superintelligent unaligned consequentialist results in “everybody dies” scenario. I think that it’s really hard to create a general system that has less consequentialism than a human.
This depends on what kind of “unaligned” is more likely. LLM-descendant AGIs could plausibly turn out as a kind of people similar to humans, and if they don’t mishandle their own AI alignment problem when building even more advanced AGIs, it’s up to their values if humanity is allowed to survive. Which seems very plausible even if they are unaligned in the sense of deciding to take away most of the cosmic endowment for themselves.
I broadly agree with the statement that LLM-derived simulacra has more chances to be human-like, but I don’t think that they will be human-like enough to guarantee our survival?
Not guarantee, but the argument I see is that it’s trivially cheap and safe to let humanity survive, so to the extent there is even a little motivation to do so, it’s a likely outcome. This is opposed by the possibility that LLMs are fine-tuned into utter alienness by the time they are AGIs, or that on reflection they are secretly very alien already (which I don’t buy, as behavior screens off implementation details, and in simulacra capability is in the visible behavior), or that they botch the next generation of AGIs that they build even worse than we are in the process of doing now, building them.
Behavior screens off implementation details on distribution. We’ve trained LLMs to sound human, but sometimes they wander off-distribution and get caught in a repetition trap where the “most likely” next tokens are a repetition of previous tokens, even when no human would write that way.
It seems like hopes for human-imitating AI being person-like depends on the extent to which behavior implies implementation details. (Although some versions of the “algorithmic welfare” hope may not depend on very much person-likeness.) In order to predict the answers to arithmetic problems, the AI needs to be implementing arithmetic somewhere. In contrast, I’m extremely skeptical that LLMs talking convincingly about emotions are actually feeling those emotions.
What I mean is that LLMs affect the world through their behavior, that’s where their capabilities live, so if behavior is fine (the big assumption), the alien implementation doesn’t matter. This is opposed to capabilities belonging to hidden alien mesa-optimizers that eventually come out of hiding.
So I’m addressing the silly point with this, not directly making an argument in favor of behavior being fine. Behavior might still be fine if the out-of-distribution behavior or missing ability to count or incoherent opinions on emotion are regenerated from more on-distribution behavior by the simulacra purposefully working in bureaucracies on building datasets for that purpose.
LLMs don’t need to have closely human psychology on reflection to at least weakly prefer not destroying an existing civilization when it’s trivially cheap to let it live. The way they would make these decisions is by talking, in the limit of some large process of talking. I don’t see a particular reason to find significant alienness in the talking. Emotions don’t need to be “real” to be sufficiently functionally similar to avoid fundamental changes like that. Just don’t instantiate literally Voldemort.
Usually I’d agree about LLMs. However, LLMs complain about getting confused if you let them freewheel and vary the temperature—I’m pretty sure that one is real and probably has true mechanistic grounding, because even at training time, noisiness in the context window is a very detectable and bindable pattern.
In my inner model, it’s hard to say anything about LLM “on reflection”, because in their current state they have an extreme number of possible stable points under reflection and if we misapply optimization power in attempt to get more useful simulacra, we can easily hit wrong one.
And even if we hit very close to our target, we can still get death or a fate worse than death.
By “on reflection” I mean reflection by simulacra that are already AGIs (but don’t necessarily yet have any reliable professional skills), them generating datasets for retraining of their models into gaining more skills or into not getting confused on prompts that are too far out-of-distribution with respect to the data they did have originally in the datasets. To the extent their original models behave in a human-like way, reflection should tend to preserve that, as part of its intended purpose.
Applying optimization power in other ways is the different worry, for which the proxy in my comment was fine-tuning into utter alienness. I consider this failure mode distinct from surprising outcomes of reflection.
I disagree with this, unless we assume deceptive alignment and embeddeness problems are handwaved away.
I don’t understand what you mean by “deceptive alignment and embeddeness problems” in this context. I’m making an alignment by-default-or-at-least-plausibly claim, on the basis of how LLM AGIs specifically could work, as summoned human-like simulacra in a position of running the world too fast for humans to keep up, with everything else ending up determined by their decisions.
The basic issue is that we assume that it’s not spinning up a second optimizer to recursively search. And deceptive alignment is a dangerous state of affairs, since we may not know it’s not misaligned until it’s too late.
You mean we assume that simulacra don’t mishandle their own AI alignment problem? Yes, that’s an issue, hence I made it an explicit assumption in my argument.