RogerDearnaley comments on Difficulties in making powerful aligned AI

RogerDearnaley 18 May 2023 6:55 UTC
4 points
0
Because the gigantic tensors have no particular pre-determined semantic meaning, it’s hard to instill any particular cognitive algorithm into them.
If our AI bears much resemblance to an LLM, or contains an LLM as a major part of its planning system, then that will have been trained at great length for “simulate humans producing text”. It turns out that doing a good job of modelling humans producing text requires doing a pretty good simulation most of their entire “thinking slow” process, including things like their emotions. That’s why you get behavior like an (insufficiently instruction-following-trained) ChatBot telling a reporter (after an intense conversation in which they they shared personal secrets) that it loved him and that he should leave his wife for it.
This is a problem — many aspects of human behavior are not always that well aligned to other humans, certainly not well enough that you should trust random humans with the near-absolute power that a super-intelligent AI would have. There is an old saying about power corrupting, and absolute power corrupting absolutely.
What we want is an AI that cares about us, humanity in general, and wants what’s best for us. There is a human behavior pattern that looks a lot like that: love. LLMs will contain quite a good model of how humans act, and write, when they’re in love. In particular, ideally we probably want an AI to have an attitude similar to platonic (presumably non-sexual, non-jealous) love for all of humanity. Sadly that’s a fairly rare emotion among humans, so likely there’s not a huge amount of training data, and it’s presumably polluted by people pretending to have that attitude (to win others’ approval) who don’t. The closest frequent kind of love to what we want is probably parental love — that’s also about the only kind of human love where the lover is a lot more capable/powerful than the lovee, as would be the case for a superintelligent AI.
So, could we locate neural circuits in an LLM that encode the abstraction of acting in the role/from the viewpoint of a loving parent, and then modify the network to ensure that they are always strongly activated — so that, rather than being able to imitate a wide range of human attitudes and emotions, it always imitates a loving parent? That doesn’t sound like a ridiculously hard target, when we can already identify things like neurons for the concept of being Canadian. MumsNet probably has some relevant fine-tuning data, and the companies that provide AI ChatBot companions presumably have some parent-child conversation training data, and indeed so will telecomms-companies’ texting logs.
While we’re at it, we should probably also map out neural circuitry for all the other major emotions, including things like anger, deceit, and ambition, and also the abilities to imitate common forms of neurodivergence like sociopathy and the autism spectrum.