johnswentworth comments on Alignment By Default

johnswentworth 16 Aug 2020 21:15 UTC
2 points
I think there’s a subtle confusion here between two different claims:
- Human values evolved as a natural abstraction of some territory.
- Humans’ notion of “human values” is a natural abstraction of humans’ actual values.
It sounds like your comment is responding to the former, while I’m claiming the latter.
A key distinction here is between humans’ actual values, and humans’ model/notion of our own values. Humans’ actual values are the pile of heuristics inherited from evolution. But humans also have a model of their values, and that model is not the same as the underlying values. The phrase “human values” necessarily points to the model, because that’s how words work—they point to models. My claim is that the model is a natural abstraction of the actual values, not that the actual values are a natural abstraction of anything.
This is closely related to this section from the OP:
Human values are basically a bunch of randomly-generated heuristics which proved useful for genetic fitness; why would they be a “natural” abstraction? But remember, the same can be said of trees. Trees are a complicated pile of organic spaghetti code, but “tree” is still a natural abstraction, because the concept summarizes all the information from that organic spaghetti pile which is relevant to things far away. In particular, it summarizes anything about one tree which is relevant to far-away trees.
Roughly speaking, the concept of “human values” summarizes anything about the values of one human which is relevant to the values of far-away humans.
Does that make sense?