I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.
Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.
This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...
Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?
If you want to visualize features, you might just optimize an image to make neurons fire. Unfortunately, this doesn’t really work. Instead, you end up with a kind of neural network optical illusion — an image full of noise and nonsensical high-frequency patterns that the network responds strongly to.
I found this very helpful, thanks! I think this is maybe what Yudkowsky was getting at when he brought up adversarial examples here.
Adversarial examples are like adversarial goodhart. But an AI optimizing the universe for its imperfect understanding of the good is instead like extremal goodhart. So, while adversarial examples show that cases of dramatic non-overlap between human and ML concepts exist, it may be that you need an adversarial process to find them with nonnegligible probability. In which case we are fine.
This optimistic conjecture could be tested by looking to see what image *maximally* triggers a ML classifier. Does the perfect cat, the most cat-like cat according to ML actually look like a cat to us humans? If so, then by analogy the perfect utopia according to ML would also be pretty good. If not...
Perhaps this paper answers my question in the negative; I dont know enough ML to be sure. Thoughts?