I disagree to an extent. The examples provided seem to me to be examples of “being stupid” which agents generally have an incentive to do something about, unless they’re too stupid for that to occur to them. That doesn’t mean that their underling values will drift towards a basin of attraction.
The corrigibility thing is a basin of attraction specifically because a corrigible agent has preferences over itself and it’s future preferences. Humans do that too sometimes, but the examples provided are not that.
In general, I think you should expect dynamic preferences (cycles, attractors, chaos, etc...) anytime an agent has preferences over it’s own future preferences, and the capability to modify it’s preferences.
I disagree to an extent. The examples provided seem to me to be examples of “being stupid” which agents generally have an incentive to do something about, unless they’re too stupid for that to occur to them. That doesn’t mean that their underling values will drift towards a basin of attraction.
The corrigibility thing is a basin of attraction specifically because a corrigible agent has preferences over itself and it’s future preferences. Humans do that too sometimes, but the examples provided are not that.
In general, I think you should expect dynamic preferences (cycles, attractors, chaos, etc...) anytime an agent has preferences over it’s own future preferences, and the capability to modify it’s preferences.