Great post; I suspect this to be one of the most tractable areas for reducing s-risks!
I also like the appendix. Perhaps this is too obvious to state explicitly, but I think one reason why spite in AIs seems like a real concern is because it did evolve somewhat prominently in (some) humans. (And I’m sympathetic to the shard theory approach to understanding AI motivations, so it seems to me that the human evolutionary example is relevant. (That said, it’s only analogous if there’s a multi-agent-competition phase in the training, which isn’t the case with LLM training so far, for instance.))
Great post; I suspect this to be one of the most tractable areas for reducing s-risks!
I also like the appendix. Perhaps this is too obvious to state explicitly, but I think one reason why spite in AIs seems like a real concern is because it did evolve somewhat prominently in (some) humans. (And I’m sympathetic to the shard theory approach to understanding AI motivations, so it seems to me that the human evolutionary example is relevant. (That said, it’s only analogous if there’s a multi-agent-competition phase in the training, which isn’t the case with LLM training so far, for instance.))