I assume we all agree that the system can understand the human ontology, though? This is at least necessary for communicating and reasoning about humans, which LLMs can clearly already do to some extent.
Can we reason about a thermostat’s ontology? Only sort of. We can say things like “The thermostat represents the local temperature. It wants that temperature to be the same as the set point.” But the thermostat itself is only very loosely approximating that kind of behavior—imputing any sort of generalizability to it that it doesn’t actually have is an anthropmorphic fiction. And it’s blatantly a fiction, because there’s more than one way to do it—you can suppose the thermostat wants only the temperature sensor to be at the right temperature vs. it wants the whole room vs. the whole world to be at that temperature, or that it’s “changing its mind” when it breaks vs. it would want to be repaired, etc.
To the superintelligent AI, we are the thermostat. You cannot be aligned to humans purely by being smart, because finding “the human ontology” is an act of interpretation, of story-telling, not just a question of fact. Helping an AI narrow down how to interpret humans as moral patients requires giving it extra assumptions or meta-level processes. (Or as I might call it, “solving the alignment problem.”)
How can this be, if a smart AI can talk to humans intelligibly and predict their behavior and so forth, even without specifying any of my “extra assumptions”? Well, how can we interact with a thermostat in a way that it can “understand,” even without fixing any particular story about its desires? We understand how it works in our own way, and we take actions using our own understanding. Often our interactions fall in the domain of the normal functioning of the thermostat, under which several different possible stories about “what the thermostat wants” apply, and sometimes we think about such stories but mostly we don’t bother.
Your thermostat example seems to rather highlight a disanalogy: The concept of a goal doesn’t apply to the thermostat because there is apparently no fact of the matter about which counterfactual situations would satisfy such a “goal”. I think part of the reason is that the concept of a goal requires the ability to apply it to counterfactual situations. But for humans there is such a fact of the matter; there are things that would be incompatible with or required by our goals. Even though some/many other things may be neutral (neither incompatible nor necessary).
So I don’t think there are any “extra assumptions” needed. In fact, even if there were such extra assumptions, it’s hard to see how they could be relevant. (This is analogous to the ancient philosophical argument that God declaring murder to be good obviously wouldn’t make it good, so God declaring murder to be bad must be irrelevant to murder being bad.)
Pick a goal, and it’s easy to say what’s required. But pick a human, and it’s not easy to say what their goal is.
Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense “make me a different person.” And even worse, sometimes I’m irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it’s common in my culture and I don’t have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.
More complicated yes, but I assume the question is whether superintelligent AIs can understand what you want “overall” at least as good as other humans. And here, I would agree with ozziegooen, the answer seems to be yes—even if they otherwise tend to reason about things differently than we do. Because there seems to be a fact of the matter about what you want overall, even if it is not easy to predict. But predicting it is not obviously inhibited by a tendency to think in different terms (“ontology”). Is the worry perhaps that the AI finds the concept of “what the human wants overall” unnatural, so is unlikely to optimize for it?
If there was no fact of the matter of what you want overall, there would be no fact of the matter of whether an AI is aligned with you or not. Which would mean there is no alignment problem.
The referenced post seems to apply specifically to IRL, which is purely based on behaviorism and doesn’t take information about the nature of the agent into account. (E.g. the fact that humans evolved from natural selection tells us a lot of what they probably want, and information about their brain could tell us how intelligent they are.) It’s also only an epistemic point about the problem of externally inferring values, not about those values not existing.
See my sequence “Reducing Goodhart” for what I (or me from a few years ago) think the impact is on the alignment problem.
the fact that humans evolved from natural selection tells us a lot of what they probably want,
Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.
My take:
Can we reason about a thermostat’s ontology? Only sort of. We can say things like “The thermostat represents the local temperature. It wants that temperature to be the same as the set point.” But the thermostat itself is only very loosely approximating that kind of behavior—imputing any sort of generalizability to it that it doesn’t actually have is an anthropmorphic fiction. And it’s blatantly a fiction, because there’s more than one way to do it—you can suppose the thermostat wants only the temperature sensor to be at the right temperature vs. it wants the whole room vs. the whole world to be at that temperature, or that it’s “changing its mind” when it breaks vs. it would want to be repaired, etc.
To the superintelligent AI, we are the thermostat. You cannot be aligned to humans purely by being smart, because finding “the human ontology” is an act of interpretation, of story-telling, not just a question of fact. Helping an AI narrow down how to interpret humans as moral patients requires giving it extra assumptions or meta-level processes. (Or as I might call it, “solving the alignment problem.”)
How can this be, if a smart AI can talk to humans intelligibly and predict their behavior and so forth, even without specifying any of my “extra assumptions”? Well, how can we interact with a thermostat in a way that it can “understand,” even without fixing any particular story about its desires? We understand how it works in our own way, and we take actions using our own understanding. Often our interactions fall in the domain of the normal functioning of the thermostat, under which several different possible stories about “what the thermostat wants” apply, and sometimes we think about such stories but mostly we don’t bother.
Your thermostat example seems to rather highlight a disanalogy: The concept of a goal doesn’t apply to the thermostat because there is apparently no fact of the matter about which counterfactual situations would satisfy such a “goal”. I think part of the reason is that the concept of a goal requires the ability to apply it to counterfactual situations. But for humans there is such a fact of the matter; there are things that would be incompatible with or required by our goals. Even though some/many other things may be neutral (neither incompatible nor necessary).
So I don’t think there are any “extra assumptions” needed. In fact, even if there were such extra assumptions, it’s hard to see how they could be relevant. (This is analogous to the ancient philosophical argument that God declaring murder to be good obviously wouldn’t make it good, so God declaring murder to be bad must be irrelevant to murder being bad.)
Pick a goal, and it’s easy to say what’s required. But pick a human, and it’s not easy to say what their goal is.
Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense “make me a different person.” And even worse, sometimes I’m irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it’s common in my culture and I don’t have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.
More complicated yes, but I assume the question is whether superintelligent AIs can understand what you want “overall” at least as good as other humans. And here, I would agree with ozziegooen, the answer seems to be yes—even if they otherwise tend to reason about things differently than we do. Because there seems to be a fact of the matter about what you want overall, even if it is not easy to predict. But predicting it is not obviously inhibited by a tendency to think in different terms (“ontology”). Is the worry perhaps that the AI finds the concept of “what the human wants overall” unnatural, so is unlikely to optimize for it?
“It sure seems like there’s a fact of the matter” is not a very forceful argument to me, especially in light of things like it being impossible to uniquely fit a rationality model and utility function to human behavior.
If there was no fact of the matter of what you want overall, there would be no fact of the matter of whether an AI is aligned with you or not. Which would mean there is no alignment problem.
The referenced post seems to apply specifically to IRL, which is purely based on behaviorism and doesn’t take information about the nature of the agent into account. (E.g. the fact that humans evolved from natural selection tells us a lot of what they probably want, and information about their brain could tell us how intelligent they are.) It’s also only an epistemic point about the problem of externally inferring values, not about those values not existing.
See my sequence “Reducing Goodhart” for what I (or me from a few years ago) think the impact is on the alignment problem.
Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.