I very much agree with you that we should be analyzing the question in terms of the type of AGI we’re most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict “niceness” of the resulting ASI based on the base LLM being very “nice”. See my answer to this question.
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.
I very much agree with you that we should be analyzing the question in terms of the type of AGI we’re most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict “niceness” of the resulting ASI based on the base LLM being very “nice”. See my answer to this question.
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.