I claim that the way to properly solve embedded agency is to do abstract agent foundations such that embedded agency falls out naturally as one adds an embedding.
In the abstract, an agent doesn’t terminally care to use an ability to modify its utility function.
Suppose a clique of spherical children in a vacuum [edit: …pictured on the right] found each other by selecting for their utility functions to be equal on all situations considered so far. They invest in their ability to work together, as nature incentivizes them to
They face a coordination problem: As they encounter new situations, they might find disagreements. Thus, they agree to shift their utility functions precisely in the direction of satisfying whatever each other’s preferences turn out to be.
This is the simplest case I yet see where alignment as a concept falls out explicitly. It smells like it fails to scale in any number of ways, which is worrisome for our prospects. Another point for not trying to build a utility maximizer.
I claim that the way to properly solve embedded agency is to do abstract agent foundations such that embedded agency falls out naturally as one adds an embedding.
In the abstract, an agent doesn’t terminally care to use an ability to modify its utility function.
Suppose a clique of spherical children in a vacuum [edit: …pictured on the right] found each other by selecting for their utility functions to be equal on all situations considered so far. They invest in their ability to work together, as nature incentivizes them to
They face a coordination problem: As they encounter new situations, they might find disagreements. Thus, they agree to shift their utility functions precisely in the direction of satisfying whatever each other’s preferences turn out to be.
This is the simplest case I yet see where alignment as a concept falls out explicitly. It smells like it fails to scale in any number of ways, which is worrisome for our prospects. Another point for not trying to build a utility maximizer.