Because we don’t know when a neural network runs into the Mesa optimization problem we are prone to adversarial attacks ?
Like the example with red doors is a neat one. There as a human programmer we thought that the algorithm is learning to read red doors but maybe all it was doing was to learn to distinguish red from everything else.
Also isn’t every neural network we train today performing some sort of Mesa optimization ?
Because we don’t know when a neural network runs into the Mesa optimization problem we are prone to adversarial attacks ? Like the example with red doors is a neat one. There as a human programmer we thought that the algorithm is learning to read red doors but maybe all it was doing was to learn to distinguish red from everything else. Also isn’t every neural network we train today performing some sort of Mesa optimization ?