Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.
For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.
So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.
as applied to current foundation models it appears to do so
I don’t think the outputs of RLHF’d LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don’t think they meaningfully have preferences in that sense at all.)
I very much agree with you that we should be analyzing the question in terms of the type of AGI we’re most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict “niceness” of the resulting ASI based on the base LLM being very “nice”. See my answer to this question.
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.
Base model LLMs are trained off human data. So by default they generate a prompt-dependent distribution of simulated human behavior with about the same breadth of degrees of kindness as can be found on the Internet/in books/etc. Which is a pretty wide range.
For instruct-trained models, RLHF for helpfulness and harmlessness seems likely to increase kindness, and superficially as applied to current foundation models it appears to do so. RL with many other objectives could, generally, induce powerseeking and thus could reasonably be expected to decrease it. Prompting can of course have wide range of effects.
So if we build an AGI based around an agentified fine-tuned LLM, the default level of kindness is probably in the order-of-magnitude of that of humans (who, for example, build nature reserves). A range of known methods seem likely to modify that significantly, up or down.
I don’t think the outputs of RLHF’d LLMs have the same mapping to the internal cognition which generated them that human behavior does to the human cognition which generated it. (That is to say, I do not think LLMs behave in ways that look kind because they have a preference to be kind, since right now I don’t think they meaningfully have preferences in that sense at all.)
I very much agree with you that we should be analyzing the question in terms of the type of AGI we’re most likely to build first, which is agentized LLMs or something else that learns a lot from human language.
I disagree that we can easily predict “niceness” of the resulting ASI based on the base LLM being very “nice”. See my answer to this question.
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.