RogerDearnaley comments on What are the best arguments for/against AIs being “slightly ‘nice’”?

RogerDearnaley 25 Sep 2024 3:39 UTC
6 points
4
I agree that predicting the answer to this question is hard. I’m just pointing out that the initial distribution for a base model LLM is predictably close to human behavior on the Internet/in books (which are, often, worse than in RL), but that this could get modified a lot in the process of turning a base-model LLM into an AGI agent.
Still, I don’t think 0 niceness it the median expectation: the base model inherits some niceness from humans via the distillation-like process of training it. Which is a noticeable difference from what people on LEssWrong/at MIRI thought, say, a decade ago, when the default assumption was that AGI would be trained primarily with RL, and RL-induced powerseeking seemed likely to by default produce ~0 niceness.
- Noosphere89 25 Sep 2024 4:13 UTC
  0 points
  −9
  Parent
  The bigger difference is how much Lesswrong/MIRI got human value formation and the complexity of human values wrong, but that’s a very different discussion, so I’ll leave it as a comment than a post here.