I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.