A hypothesis based on this post:
Consider the subset of “human values” that we’d be “happy” (where we fully informed) for powerful systems to optimise for.
[Weaker version: “the subset of human values that it is existentially safe for powerful systems to optimise for”.]
Let’s call this subset “ideal values”.
I’d guess that the “most natural” abstraction of values isn’t “ideal values” themselves but something like “the minimal latents of ideal values”.
Examples of what I mean by a concept being a “more natural” abstraction:
The concept is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged → more natural
Most efficient representations of our universe contain simple embeddings of the concept
Simpler embeddings → more natural
A hypothesis based on this post:
Consider the subset of “human values” that we’d be “happy” (where we fully informed) for powerful systems to optimise for.
[Weaker version: “the subset of human values that it is existentially safe for powerful systems to optimise for”.]
Let’s call this subset “ideal values”.
I’d guess that the “most natural” abstraction of values isn’t “ideal values” themselves but something like “the minimal latents of ideal values”.
Examples of what I mean by a concept being a “more natural” abstraction:
The concept is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
More privileged → more natural
Most efficient representations of our universe contain simple embeddings of the concept
Simpler embeddings → more natural