I kinda agree with this as well. Except that it seems completely unclear to me whether recreating the missing human capabilities/brain systems takes two years or two decades or even longer.
It doesn’t seem to me to be a single missing thing and for each separate step holds: That it hasn’t been done yet is evidence that it’s not that easy.
A few years ago I had a similar idea, which I called Rawlsian Reinforcement Learning: The idea was to provide scenarios similar to those in this post and evaluate the actions of the model as to which person benefits how much from them. Then reinforce based on mean benefit of all characters in the scenario, or a variation thereof, i.e. the reinforcement signal does not use the information which character in the scenario is the model.
Maybe I misunderstand your method but it seems to me that you untrain the self-other distinction which in the end is a capability. So the model might not become more moral, instead it just loses the capacity to benefit itself because it cannot distinguish between itself and others.