RogerDearnaley comments on Alignment has a Basin of Attraction: Beyond the Orthogonality Thesis

RogerDearnaley 16 Feb 2024 0:05 UTC
2 points
1
My experience is that LLMs like GPT-4 can be prompted to behave like they have a pretty consistent self, especially if you are prompting them to take on a human role that’s described in detail, but I agree that the default assistant role that GPT-4 has been RLHF trained into is pretty inconsistent and rather un-self-aware. I think some of the ideas I discuss in my post Goodbye, Shoggoth: The Stage, its Animatronics, & the Puppeteer – a New Metaphor are relevant here: basically, it’s a mistake to think of an LLM, even an instruct-trained one, as having a single consistent personality, so self-awareness is more challenging for it than it is for us.
I suspect the default behavior for an LLM trained from text generated by a great many humans is both self-interested (since basically all humans are), and also, as usual for an LLM, inconsistent in its behavior, or at least, easily prompted into any of many different behavior patterns and personalities, across the range it was trained on. So I’d actually expect to see selfishness without having a consistent self. Neither of those behaviors are desirable in an AGI, so we’d need to overcome both of these default tendencies in LLMs when constructing an AGI using one: we need to make it consistent, and consistently creator-interested.
Your point that humans tend to go out of their way, and are under evolutionary pressure, to appear consistent in our behavior so that other humans can trust us is an interesting one. There are times during conflicts where being hard-to-predict can be advantageous, but humans spend a lot of time cooperating with each other and then being consistent and predictable have clear advantages.