Ricardo Meneghin comments on A broad basin of attraction around human values?

Ricardo Meneghin 12 Apr 2022 11:53 UTC
1 point
When training a neural network, is there a broad basin of attraction around cat classifiers? Yes. There is a gigantuous number of functions that perfectly match the observed data and yet and discarded by the simplicity (and other) biases in our training algorithm in favor of well-behaving cat classifiers. Around any low kolmogorov complexity object there is an immense neighborhood of high complexity ones.
But it occurs to me that the overseer, or the system composing of overseer and corrigible AI, itself constitutes an agent with a distorted version of the overseer’s true or actual preferences
The only way I can see this making sense is if you again have a bias of simplicity for values, otherwise you are claiming that there is some value function that is more complex than the current value function of these agents and that it is privileged against the current one—but then, to arrive at this function you have to conjure information out of nowhere. If you took the information from other places, like averaging the values of many agents, then you actually want to align with the values of these many agents, or whatever else you used.
In fact it seems to be the case with your examples that you are favoring simplicity—if the agents were smarter they would realize their values were misbehaving. But that *is* looking for simpler values—if you through reasoning discovered some part of your values contradict others, you have just arrived at a simpler value function, since the contradicting parts needed extra specification, i.e. were noisy, and you weren’t smart enough to see that.