I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases—or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what’s going on better.
It’s tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think “utility function” is not a very good way to think about values because what is it a function of; we don’t have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the “values/preferences” representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against “committing to the specific division between outer and inner alignment steers you wrong”—I’m just saying things that seem true to me and plausibly relevant to the debate.)
I think it is reasonable as engineering practice to try and make a fully classically-Bayesian model of what we think we know about the necessary inductive biases—or, perhaps more realistically, a model which only violates classic Bayesian definitions where necessary in order to represent what we want to represent.
This is because writing down the desired inductive biases as an explicit prior can help us to understand what’s going on better.
It’s tempting to say that to understand how the brain learns, is to understand how it treats feedback as evidence, and updates on that evidence. Of course, there could certainly be other theoretical frames which are more productive. But at a deep level, if the learning works, the learning works because the feedback is evidence about the thing we want to learn, and the process which updates on that feedback embodies (something like) a good prior telling us how to update on that evidence.
And if that framing is wrong somehow, it seems intuitive to me that the problem should be describable within that ontology, like how I think “utility function” is not a very good way to think about values because what is it a function of; we don’t have a commitment to a specific low-level description of the universe which is appropriate for the input to a utility function. We can easily move beyond this by considering expected values as the “values/preferences” representation, without worrying about what underlying utility function generates those expected values.
(I do not take the above to be a knockdown argument against “committing to the specific division between outer and inner alignment steers you wrong”—I’m just saying things that seem true to me and plausibly relevant to the debate.)