I really think the best arguments for and against AIs being slightly nice are almost entirely different than the ones from that thread.
That discussion addresses all of mind-space. We can do much better if we address the corner of mind-space that’s relevant: the types of AGIs we’re likely to build first.
Those are pretty likely to be based on LLMs, and even more likely to learn a lot from human language (since it distills useful information about the world so nicely). That encodes a good deal of “niceness”. They’re also very likely to include RLHF/RLAIF or something similar, which make current LLMs sort of absurdly nice.
Does that mean we’ll get aligned or “very nice” AGI by default? I don’t think so. But it does raise the odds substantially that we’ll get a slightly nice AGI even if we almost completely screw up alignment.
The key issue in whether an autonomous mind with those starting influences winds up being “nice” is The alignment stability problem. This has been little addressed outside of reflective stability; it’s pretty clear that the most important goal will be reflectively stable; it’s pretty much part of the definition of having a goal that you don’t want to change it before you achieve it. It’s much less clear what the stable equilibrium is in a mind with a complex set of goals. Humans don’t live long enough to reach a stable equilibrium. AGIs with preferences encoded in deep networks may reach equilibrium rather quickly.
What equilibrium they reach is probably dependent on how they make decisions about updating their beliefs and goals. I’ve had a messy rough draft on this for years, and I’m hoping to post a short version. But it doesn’t have answers, it just tries to clarify the question and argue that it deserves a bunch more thought.
The other perspective is that it’s pretty unlikely that such a mind will reach an equilibrium autonomously. I’m pretty sure that Instruction-following AGI is easier and more likely than value aligned AGI, so we’ll probably have at least some human intervention on the trajectory of those minds before they become fully autonomous. That could also raise the odds of some accidental “niceness” even if we don’t successfully put them on a trajectory for full value alignment before they are granted or achieve autonomy.
I really think the best arguments for and against AIs being slightly nice are almost entirely different than the ones from that thread.
That discussion addresses all of mind-space. We can do much better if we address the corner of mind-space that’s relevant: the types of AGIs we’re likely to build first.
Those are pretty likely to be based on LLMs, and even more likely to learn a lot from human language (since it distills useful information about the world so nicely). That encodes a good deal of “niceness”. They’re also very likely to include RLHF/RLAIF or something similar, which make current LLMs sort of absurdly nice.
Does that mean we’ll get aligned or “very nice” AGI by default? I don’t think so. But it does raise the odds substantially that we’ll get a slightly nice AGI even if we almost completely screw up alignment.
The key issue in whether an autonomous mind with those starting influences winds up being “nice” is The alignment stability problem. This has been little addressed outside of reflective stability; it’s pretty clear that the most important goal will be reflectively stable; it’s pretty much part of the definition of having a goal that you don’t want to change it before you achieve it. It’s much less clear what the stable equilibrium is in a mind with a complex set of goals. Humans don’t live long enough to reach a stable equilibrium. AGIs with preferences encoded in deep networks may reach equilibrium rather quickly.
What equilibrium they reach is probably dependent on how they make decisions about updating their beliefs and goals. I’ve had a messy rough draft on this for years, and I’m hoping to post a short version. But it doesn’t have answers, it just tries to clarify the question and argue that it deserves a bunch more thought.
The other perspective is that it’s pretty unlikely that such a mind will reach an equilibrium autonomously. I’m pretty sure that Instruction-following AGI is easier and more likely than value aligned AGI, so we’ll probably have at least some human intervention on the trajectory of those minds before they become fully autonomous. That could also raise the odds of some accidental “niceness” even if we don’t successfully put them on a trajectory for full value alignment before they are granted or achieve autonomy.