Well for starters, it narrows down the kind of type signature you might need to look for to find something like a “desire” inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the “human values” we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you’re actually aiming for might also be useful for many other alignment schemes.
Well for starters, it narrows down the kind of type signature you might need to look for to find something like a “desire” inside an AI, if the training dynamics described here are broad enough to hold for the AI too.
It also helped me become less confused about what the “human values” we want the AI to be aligned with might actually mechanistically look like in our own brains, which seems useful for e.g. schemes where you try to rewire the AI to have a goal given by a pointer to its model of human values. I imagine having a better idea of what you’re actually aiming for might also be useful for many other alignment schemes.