Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
You will have successfully narrowed human values down to within the range of things that are strongly correlated with human values in the training environment. If you take this signal and apply enough optimization pressure, you are going to get the equivalent of a universe tiled with tiny smiley faces.
The whole point of the “natural abstractions” section of the OP is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the “proxy problems” section of the post, but I do not expect it to be an issue for identifying natural abstractions.
Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
In order for the network to produce good pictures, the concept of “tree” must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of “tree” or “green thing”.
Note that the examples in the OP are from an adversarial generative network. If its notion of “tree” were just “green things”, the adversary should be quite capable of exploiting that.
The whole point of the “natural abstractions” section of the OP is that I do not think this will actually happen. Off-distribution behavior is definitely an issue for the “proxy problems” section of the post, but I do not expect it to be an issue for identifying natural abstractions.
In order for the network to produce good pictures, the concept of “tree” must be hidden in there somewhere, but it could be hidden in a complicated and indirect manor. I am questioning whether the particular single node selected by the researchers encodes the concept of “tree” or “green thing”.
Ah, I see. You’re saying that the embedding might not actually be simple. Yeah, that’s plausible.