TurnTrout comments on If Wentworth is right about natural abstractions, it would be bad for alignment

TurnTrout 16 Dec 2022 23:49 UTC
4 points
1
Upvoted, nice original thoughts in this post IMO!
If the AI forms the abstraction “diamond” itself, we could just point at that abstraction in the AI’s mind, and say: “maximize that one”, instead of trying to formulate what a diamond is rigorously. This was proposed in combination with shard theory to solve the diamond maximization problem. If it would naturally form an abstraction of human values, alignment might be easier than we thought (alignment by default). We could just point at that abstraction and program it, to adhere to those values.
A shot at the diamond-alignment problem does indeed rely on the formation of a diamond abstraction in the policy network. Two points:
1. I was purposefully not tackling the diamond maximization problem, and briefly noted that^[1] in an early paragraph in the post (but probably I should state that more clearly). I think “make a crisp maximizer for a simple-seeming quantity” is another fatal assumption made by “classic” theory.
  
  Very few people seem to note there’s an important difference. In his critique, Nate Soares seemed to write that I was trying to solve diamond maximization. I have reminded people that the post was not about diamond maximization, about five times this week alone. I should probably update the story.
2. My story doesn’t involve the designers surgically modifying the network to care about the diamond abstraction, which is what I read you as implying here. (are you? EDIT after reading further, I think you didn’t mean that. Maybe worth clarifying the quoted portion?)
Notice how the first might be true, while the second might be completely off. Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.
Nitpick: I agree with the distinction between NAH and WAH, but I think this wouldn’t constitute an example of NAH true, WAH false. Wait, before I elaborate further—was the second sentence supposed to contain such an example?
1. ^
  Quoted:
  It’s also OK if the AI doesn’t maximize diamonds, and instead just makes a whole lot of diamonds.[1]
  [1]: I think that pure diamond maximizers are anti-natural, and at least not the first kind of successful story we should try to tell. Furthermore, the analogous version for an aligned AI seems to be “an AI which really helps people, among other goals, and is not a perfect human-values maximizer (whatever that might mean).”
  For more intuitions for why I think pure x-maximization is anti-natural, see this subsection of Inner and outer alignment decompose one hard problem into two extremely hard problems.
- Wuschel Schulz 20 Dec 2022 17:24 UTC
  2 points
  0
  Parent
  Thanks a lot for the comment and correction :)
  I updated “diamond maximization problem” to “diamond alignment problem”.
  I didn’t understand your proposal to involve surgically inserting the drive to value “diamonds are good”, but instead systematically rewarding the agent for acquiring diamonds so that a diamond shard forms organically. I also edited that sentence.
  I am not sure I get your Nitpick: “Just as you can deny that Newtonian mechanics is true, without denying that heavy objects attract each other.” was supposed to be an example of “The specific theory is wrong, but the general phenomenon which it tries to describe exists”. In the same way that I think Natural Abstractions exist but (my flawed understanding) of Wentworths theory of natural abstractions is wrong. It was not supposed to be an example of a natural abstraction itself.