Mostly inspired by the comments in the post on Plans not being optimization.
When alignment researchers think about corrigibility and values do they think of values more in terms of states and corrigibility as process?
Mostly inspired by the comments in the post on Plans not being optimization.
When alignment researchers think about corrigibility and values do they think of values more in terms of states and corrigibility as process?
I mostly don’t think about corrigibility. But when I do, it’s generally to label departures from an AI having agenty structure. Some people like to think about corrigibility in terms of stimulus-response patterns, or behavioral guarantees, or as a grab-bag of doomed attempts to give orders to something smarter than you without understanding it. These are all fine too.
I definitely think about values more in terms of abstract states. By “abstract” I mean that states don’t have to be specific states of the universe’s quantum wavefunction, they can be anything that fills the role of “state” in a hierarchical set of models of the world.
It’s not that I’m hardcore committed to an AI never learning values that are about process. But I tend to think of even those in terms of state—as the AI having a model of itself and its own decision-making that it can control variables of like “how do I make decisions?” (Or even as vague as “Am I being good and just?”)
Basically this is because I think that some state-based preferences are really important, and once you have those, deontological rules that have no grounding in state whatsoever are unnatural.