Concepts are contingent upon telos, i.e. they depend on what’s useful to the process creating the ontology. So it seems like this contingency should sink the project.
But, reality is the same reality for everything embedded in it (or so it strongly seems), and most processes have some commonality in their telos. For example, most things want (“want” in the sense that they try to get the world into certain states, like the way a thermostat tries to make its sensor read a particular temperature) to survive (continue existing) because of selection effects (things that don’t want to survive quickly go away). So most processes model the world in ways that enable their survival.
This might be enough to get instrumental convergence towards common abstractions across a lot of processes. But I think it’s unclear yet how much convergence is possible or likely. There’s some empirical question about this we have yet to answer because we don’t have enough different processes that aren’t indirectly influenced by human telos to draw robust conclusions.
So my current guess is that some weak version of NAH is true while a full, stronger version is not. There’s some abstractions that many processes will develop because they’re commonly useful, but this effect may not be as strong as we hope for, especially at the fringes or under heavy optimization pressure.
Oh, that’s interesting. Yeah, I hadn’t thought about how instrumental convergence might play into this before. Just thought I’d note that the language game critique is very similar to your telos frame as Wittgenstein has this concept of “language as use” where in most cases language is a tool to achieve a particular result within a particular language game. So it sounds like you’re actually suspicious of NAH for mostly the same reasons, but where you depart is that instrumental convergence limits the effects of this divergence.
Yeah, this seems reasonable to me. I’m not deeply familiar with Wittgenstein, so my read of your presentation is that you’re paying too much attention to the fact that things are contingent and not enough attention to the fact that the structure of that contingency has a lot of commonality in each case, but I’m not surprised there’s a similar idea in his work. Of course this might be my own projection, since I’ve been pretty guilty of making this mistake and failing to appreciate the extent to which things add up to normality because of common features about how things in the world are constructed.
My expectations are the Natural Abstractions Hypothesis probably works out as long as we don’t try to include values/ethics/morality into the mix, so I am more optimistic on the convergence of non moral abstractions.
This is important, because while it wouldn’t let us automatically solve the alignment problem, it does make it way easier to change a model’s goals.
The question of what norms to adopt does not appear to be at stake with the NAH, but arguably the structure of norms is—the concepts we use to express norms and constrain the space of possible norms. NAH, if true, should be able to pick out the menu of norms to choose from, say, but then it’s a separate question of which norms to order off that menu.
The major point I am making here is that my slightly held belief on the Natural Abstractions Hypothesis is that it probably holds, allowing for cases where it does in fact fail, rather than the alternative hypothesis where the natural abstractions hypothesis doesn’t hold at all.
Morality/ethics/values is my proposed failure case/error case, since I don’t think even the weak version holds, that is I don’t think that there a finite set of valid abstractions of values/morals from the environment.
My expectation is that there is an infinite set of valid moralities, and that’s not consistent with even the weak version of the natural abstraction hypothesis.
Hmm, got some complex thoughts here.
I am suspicious of NAH but for different reasons.
Concepts are contingent upon telos, i.e. they depend on what’s useful to the process creating the ontology. So it seems like this contingency should sink the project.
But, reality is the same reality for everything embedded in it (or so it strongly seems), and most processes have some commonality in their telos. For example, most things want (“want” in the sense that they try to get the world into certain states, like the way a thermostat tries to make its sensor read a particular temperature) to survive (continue existing) because of selection effects (things that don’t want to survive quickly go away). So most processes model the world in ways that enable their survival.
This might be enough to get instrumental convergence towards common abstractions across a lot of processes. But I think it’s unclear yet how much convergence is possible or likely. There’s some empirical question about this we have yet to answer because we don’t have enough different processes that aren’t indirectly influenced by human telos to draw robust conclusions.
So my current guess is that some weak version of NAH is true while a full, stronger version is not. There’s some abstractions that many processes will develop because they’re commonly useful, but this effect may not be as strong as we hope for, especially at the fringes or under heavy optimization pressure.
Oh, that’s interesting. Yeah, I hadn’t thought about how instrumental convergence might play into this before. Just thought I’d note that the language game critique is very similar to your telos frame as Wittgenstein has this concept of “language as use” where in most cases language is a tool to achieve a particular result within a particular language game. So it sounds like you’re actually suspicious of NAH for mostly the same reasons, but where you depart is that instrumental convergence limits the effects of this divergence.
Yeah, this seems reasonable to me. I’m not deeply familiar with Wittgenstein, so my read of your presentation is that you’re paying too much attention to the fact that things are contingent and not enough attention to the fact that the structure of that contingency has a lot of commonality in each case, but I’m not surprised there’s a similar idea in his work. Of course this might be my own projection, since I’ve been pretty guilty of making this mistake and failing to appreciate the extent to which things add up to normality because of common features about how things in the world are constructed.
My expectations are the Natural Abstractions Hypothesis probably works out as long as we don’t try to include values/ethics/morality into the mix, so I am more optimistic on the convergence of non moral abstractions.
This is important, because while it wouldn’t let us automatically solve the alignment problem, it does make it way easier to change a model’s goals.
Why would norms be special here?
The question of what norms to adopt does not appear to be at stake with the NAH, but arguably the structure of norms is—the concepts we use to express norms and constrain the space of possible norms. NAH, if true, should be able to pick out the menu of norms to choose from, say, but then it’s a separate question of which norms to order off that menu.
The major point I am making here is that my slightly held belief on the Natural Abstractions Hypothesis is that it probably holds, allowing for cases where it does in fact fail, rather than the alternative hypothesis where the natural abstractions hypothesis doesn’t hold at all.
Morality/ethics/values is my proposed failure case/error case, since I don’t think even the weak version holds, that is I don’t think that there a finite set of valid abstractions of values/morals from the environment.
My expectation is that there is an infinite set of valid moralities, and that’s not consistent with even the weak version of the natural abstraction hypothesis.