Empathy bandaid for immediate AI catastrophe

Assuming that agi agent is being built with q-learning and an LLM for the world model, we can do the following to emulate empathy:

  1. Train a separate agent action reasoning network. For LLM tech this should be training on completing interaction sentences, think “Alice pushed Bob. ___ fell due to ___”, with a tokenizer that generalizes agents(Alice and Bob) into generic {agent 1, agent n} and “self agent”. Then we replace various Alices and Bobs in various action sentences with generic agent tokens, and train on guessing consequences or prerequisites of various actions from real situations that you can get from any text corpus.

  2. Prune anything that has to do with agent reasoning from the parent LLM. Any reasoning has to go through the agent reasoning network.

  3. In anything that has to do with Q-learning cost we replace tokens in agent reasoning net with “self token”, rerun the network, and take highest cost of two. Agent will be forced to treat harm to anyone else as harm to itself.

As i say, emulated empathy. Doesn’t solve alignment, obviously, but would help a bit against immediate danger. If people are gonna be stupid and create agents, we can at least make sure agents aren’t electronic psychopaths. This solution assumes a lot about how first agi will be built, but i believe my assumptions are in the right direction. I make this judgement from looking into suggestions such as JEPA and from my general understanding of the human mind.

Now, I’m a nobody in this field, but i haven’t heard this approach being discussed, so hoping someone with actual media presence(Eliezer) can get the idea over to the people developing ai. If this was internally suggested at openai/​anthropic/​ms/​google—good, if not it’s paramount i get it over to them. Any suggestions on how to contact important people in the field, and actually get them to pay attention to some literal who are welcome, also feel free to rephrase my idea and disseminate it yourself.

If first agi isn’t created with some form of friendliness baked into the architecture, we’ll all be dead before 2030s, i believe anything that can help is of infinite importance. And it’s evident people are right now trying to create agi.