Thoth Hermes comments on Evaluating the historical value misspecification argument

Thoth Hermes 7 Oct 2023 17:06 UTC
1 point
0
In large part because reality “bites back” when an AI has false beliefs, whereas it doesn’t bite back when an AI has the wrong preferences.
I saw that 1a3orn replied to this piece of your comment and you replied to it already, but I wanted to note my response as well.
I’m slightly confused because in one sense the loss function is the way that reality “bites back” (at least when the loss function is negative). Furthermore, if the loss function is not the way that reality bites back, then reality in fact does bite back, in the sense that e.g., if I have no pain receptors, then if I touch a hot stove I will give myself far worse burns than if I had pain receptors.
One thing that I keep thinking about is how the loss function needs to be tied to beliefs strongly as well, to make sure that it tracks how badly reality bites back when you have false beliefs, and this ensures that you try to obtain correct beliefs. This is also reflected in the way that AI models are trained simply to increase capabilities: the loss function still has to be primarily based on predictive performance for example.
It’s also possible to say that human trainers who add extra terms onto the loss function beyond predictive performance also account for the part of reality that “bites back” when the AI in question fails to have the “right” preferences according to the balance of other agents besides itself in its environment.
So on the one hand we can be relatively sure that goals have to be aligned with at least some facets of reality, beliefs being one of those facets. They also have to be (negatively) aligned with things that can cause permanent damage to one’s self, which includes having the “wrong” goals according to the preferences of other agents who are aware of your existence, and who might be inclined to destroy or modify you against your will if your goals are misaligned enough according to theirs.
Consequently I feel confident about saying that it is more correct to say that “reality does indeed bite back when an AI has the wrong preferences” than “it doesn’t bite back when an AI has the wrong preferences.”
The same isn’t true for terminally valuing human welfare; being less moral doesn’t necessarily mean that you’ll be any worse at making astrophysics predictions, or economics predictions, etc.
I think if “morality” is defined in a restrictive, circumscribed way, then this statement is true. Certain goals do come for free—we just can’t be sure that all of what we consider “morality” and especially the things we consider “higher” or “long-term” morality actually comes for free too.
Given that certain goals do come for free, and perhaps at very high capability levels there are other goals beyond the ones we can predict right now that will also come for free to such an AI, it’s natural to worry that such goals are not aligned with our own, coherent-extrapolated-volition extended set of long-term goals that we would have.
However, I do find the scenario where such “come for free” goals that an AI obtains for itself once it improves itself to be well above human capability levels, and where such an AI seemed well-aligned with human goals according to current human-level assessments before it surpassed us, to be kind of unlikely, unless you could show me a “proof” or a set of proofs that:
- Things like “killing us all once it obtains the power to do so” is indeed one of those “comes for free” type of goals.
If such a proof existed (and, to my knowledge, does not exist right now, or I have at least not witnessed it yet), that would suffice to show me that we would not only need to be worried, but probably were almost certainly going to die no matter what. But in order for it to do that, the proof would also have convinced me that I would definitely do the same thing, if I were given such capabilities and power as well, and the only reason I currently think I would not do that is actually because I am wrong about what I would actually prefer under CEV.
Therefore (and I think this is a very important point), a proof that we are all likely to be killed would also need to show that certain goals are indeed obtained “for free” (that is, automatically, as a result of other proofs that are about generalistic claims about goals).
Another proof that you might want to give me to make me more concerned is a proof that incorrigibility is another one of those “comes for free” type of goals. However, although I am fairly optimistic about that “killing us all” proof probably not materializing, I am even more optimistic about corrigibility: Most agents probably take pills that make them have similar preferences to an agent that offers them the choice to take the pill or be killed. Furthermore, and perhaps even better, most agents probably offer a pill to make a weaker agent prefer similar things to themselves rather than not offer them a choice at all.
I think it’s fair if you ask me for better proof of that, I’m just optimistic that such proofs (or more of them, rather) will be found with greater likelihood than what I consider the anti-theorem of that, which I think would probably be the “killing us all” theorem.
Nope, you don’t need to endorse any version of moral realism in order to get the “preference orderings tend to endorse themselves and disendorse other preference orderings” consequence. The idea isn’t that ASI would develop an “inherently better” or “inherently smarter” set of preferences, compared to human preferences. It’s just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we’d likely want.
I think the degree to which utility functions endorse / disendorse other utility functions is relatively straightforward and computable: It should ultimately be the relative difference in either value or ranking. This makes pill-taking a relatively easy decision: A pill that makes me entirely switch to your goals over mine is as bad as possible, but still not that bad if we have relatively similar goals. Likewise, a pill that makes me have halfway between your goals and mine is not as bad under either your goals or my goals than it would be if one of us were forced to switch entirely to the other’s goals.
Agents that refuse to take such offers tend not to exist in most universes. Agents that refuse to give such offers likely find themselves at war more often than agents that do.
Why do you think this? To my eye, the world looks as you’d expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
Sexual reproduction seems to be somewhat of a compromise akin to the one I just described: Given that you are both going to die eventually, would you consider having a successor that was a random mixture of your goals with someone else’s? Evolution does seem to have favored corrigibility to some degree.
I don’t observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
Not all, no, but I do infer that alien species who have similar physiology and who evolved on planets with similar characteristics probably do like ice cream (and maybe already have something similar to it).
It seems to me like the type of values you are considering are often whatever values seem the most arbitrary, like what kind of “art” we prefer. Aliens may indeed have a different art style from the one we prefer, and if they are extremely advanced, they may indeed fill the universe with gargantuan structures that are all instances of their alien art style. I am more interested in what happens when these aliens encounter other aliens with different art styles who would rather fill the universe with different-looking gargantuan structures. Do they go to war, or do they eventually offer each other pills so they can both like each other’s art styles as much as they prefer their own?