Empathy bandaid for immediate AI catastrophe
Assuming that agi agent is being built with q-learning and an LLM for the world model, we can do the following to emulate empathy:
Train a separate agent action reasoning network. For LLM tech this should be training on completing interaction sentences, think “Alice pushed Bob. ___ fell due to ___”, with a tokenizer that generalizes agents(Alice and Bob) into generic {agent 1, agent n} and “self agent”. Then we replace various Alices and Bobs in various action sentences with generic agent tokens, and train on guessing consequences or prerequisites of various actions from real situations that you can get from any text corpus.
Prune anything that has to do with agent reasoning from the parent LLM. Any reasoning has to go through the agent reasoning network.
In anything that has to do with Q-learning cost we replace tokens in agent reasoning net with “self token”, rerun the network, and take highest cost of two. Agent will be forced to treat harm to anyone else as harm to itself.
As i say, emulated empathy. Doesn’t solve alignment, obviously, but would help a bit against immediate danger. If people are gonna be stupid and create agents, we can at least make sure agents aren’t electronic psychopaths. This solution assumes a lot about how first agi will be built, but i believe my assumptions are in the right direction. I make this judgement from looking into suggestions such as JEPA and from my general understanding of the human mind.
Now, I’m a nobody in this field, but i haven’t heard this approach being discussed, so hoping someone with actual media presence(Eliezer) can get the idea over to the people developing ai. If this was internally suggested at openai/anthropic/ms/google—good, if not it’s paramount i get it over to them. Any suggestions on how to contact important people in the field, and actually get them to pay attention to some literal who are welcome, also feel free to rephrase my idea and disseminate it yourself.
If first agi isn’t created with some form of friendliness baked into the architecture, we’ll all be dead before 2030s, i believe anything that can help is of infinite importance. And it’s evident people are right now trying to create agi.
Mod note, in the spirit of our experiment in more involved moderation:
This seems a bit more grounded than many first posts about AI on LessWrong, but my guess is this doesn’t meet the quality bar for. See the advice in my AI Takes section of my linked comment.
seems like a reasonable take. it’s been discussed in papers already. I’ll put some resources here in a few hours.