My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don’t ever get alignment by default by just throwing a dart into AI-space.
Claim #1 (about a “privileged subset”) is a claim that there aren’t multiple such natural abstractions (e.g. any other subset of human values that satisfies #3 would be a superset of the privileged subset, or a subset of the basin of attraction around the privileged subset.)
[But I haven’t yet fully read that post or your other linked posts.]
My best guess is that if we pretend we knew how to define a space where AIs that are similar under self-modification are close together, there would indeed be basins of attraction around most good points (AIs that do good things with the galaxy). However, I see no particular reason why there should only be one such basin of attraction, at least not without defining your space in an unnatural way. And of course there are going to be plenty of other basins of attraction, you don’t ever get alignment by default by just throwing a dart into AI-space.
A load bearing claim of the robust values hypothesis for “alignment by default” is #2:
Said subset is a “naturalish” abstraction
The more natural the abstraction, the more robust values are
Example operationalisations of “naturalish abstraction”
The subset is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged → more natural
Most efficient representations of our universe contain a simple embedding of the subset
Simpler embeddings → more natural
The safety comes from #3, and #1, but #2 is why we’re not throwing a dart at random into AI space. It’s a property that makes value learning easier.
Sure. Though see Take 4.
Claim #1 (about a “privileged subset”) is a claim that there aren’t multiple such natural abstractions (e.g. any other subset of human values that satisfies #3 would be a superset of the privileged subset, or a subset of the basin of attraction around the privileged subset.)
[But I haven’t yet fully read that post or your other linked posts.]