This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:
In ML we just keep on optimizing as the system gets smart. I think this doesn’t really work unless being kind is a competitive disadvantage for ML systems on the training distribution.
I actually think it’s somewhat likely that ML systems won’t value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that’s not the same as the system itself valuing kindness the way that humans do or aliens might.)
This comment changed my mind on the probability that evolved aliens are likely to end up kind, which I now think is somewhat more likely than 5%. I still think AI systems are unlikely to have kindness, for something like the reason you give at the end:
I actually think it’s somewhat likely that ML systems won’t value kindness at all before they are superhuman enough to take over. I expect kindness as a value within the system itself not to arise spontaneously during training, and that no one will succeed at eliciting it deliberately before take over. (The outward behavior of the system may appear to be kind, and mechanistic interpretability may show that some internal component of the system has a correct understanding of kindness. But that’s not the same as the system itself valuing kindness the way that humans do or aliens might.)