DragonGod comments on The “Minimal Latents” Approach to Natural Abstractions

DragonGod 21 Dec 2022 12:52 UTC
4 points
3
A hypothesis based on this post:

Consider the subset of “human values” that we’d be “happy” (where we fully informed) for powerful systems to optimise for.

[Weaker version: “the subset of human values that it is existentially safe for powerful systems to optimise for”.]

Let’s call this subset “ideal values”.

I’d guess that the “most natural” abstraction of values isn’t “ideal values” themselves but something like “the minimal latents of ideal values”.

Examples of what I mean by a concept being a “more natural” abstraction:
1. The concept is highly privileged by the inductive biases of most learning algorithms that can efficiently learn our universe
  - More privileged → more natural
2. Most efficient representations of our universe contain simple embeddings of the concept
  - Simpler embeddings → more natural