My expectations are the Natural Abstractions Hypothesis probably works out as long as we don’t try to include values/ethics/morality into the mix, so I am more optimistic on the convergence of non moral abstractions.
This is important, because while it wouldn’t let us automatically solve the alignment problem, it does make it way easier to change a model’s goals.
The question of what norms to adopt does not appear to be at stake with the NAH, but arguably the structure of norms is—the concepts we use to express norms and constrain the space of possible norms. NAH, if true, should be able to pick out the menu of norms to choose from, say, but then it’s a separate question of which norms to order off that menu.
The major point I am making here is that my slightly held belief on the Natural Abstractions Hypothesis is that it probably holds, allowing for cases where it does in fact fail, rather than the alternative hypothesis where the natural abstractions hypothesis doesn’t hold at all.
Morality/ethics/values is my proposed failure case/error case, since I don’t think even the weak version holds, that is I don’t think that there a finite set of valid abstractions of values/morals from the environment.
My expectation is that there is an infinite set of valid moralities, and that’s not consistent with even the weak version of the natural abstraction hypothesis.
My expectations are the Natural Abstractions Hypothesis probably works out as long as we don’t try to include values/ethics/morality into the mix, so I am more optimistic on the convergence of non moral abstractions.
This is important, because while it wouldn’t let us automatically solve the alignment problem, it does make it way easier to change a model’s goals.
Why would norms be special here?
The question of what norms to adopt does not appear to be at stake with the NAH, but arguably the structure of norms is—the concepts we use to express norms and constrain the space of possible norms. NAH, if true, should be able to pick out the menu of norms to choose from, say, but then it’s a separate question of which norms to order off that menu.
The major point I am making here is that my slightly held belief on the Natural Abstractions Hypothesis is that it probably holds, allowing for cases where it does in fact fail, rather than the alternative hypothesis where the natural abstractions hypothesis doesn’t hold at all.
Morality/ethics/values is my proposed failure case/error case, since I don’t think even the weak version holds, that is I don’t think that there a finite set of valid abstractions of values/morals from the environment.
My expectation is that there is an infinite set of valid moralities, and that’s not consistent with even the weak version of the natural abstraction hypothesis.