Pick a goal, and it’s easy to say what’s required. But pick a human, and it’s not easy to say what their goal is.
Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense “make me a different person.” And even worse, sometimes I’m irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it’s common in my culture and I don’t have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.
More complicated yes, but I assume the question is whether superintelligent AIs can understand what you want “overall” at least as good as other humans. And here, I would agree with ozziegooen, the answer seems to be yes—even if they otherwise tend to reason about things differently than we do. Because there seems to be a fact of the matter about what you want overall, even if it is not easy to predict. But predicting it is not obviously inhibited by a tendency to think in different terms (“ontology”). Is the worry perhaps that the AI finds the concept of “what the human wants overall” unnatural, so is unlikely to optimize for it?
If there was no fact of the matter of what you want overall, there would be no fact of the matter of whether an AI is aligned with you or not. Which would mean there is no alignment problem.
The referenced post seems to apply specifically to IRL, which is purely based on behaviorism and doesn’t take information about the nature of the agent into account. (E.g. the fact that humans evolved from natural selection tells us a lot of what they probably want, and information about their brain could tell us how intelligent they are.) It’s also only an epistemic point about the problem of externally inferring values, not about those values not existing.
See my sequence “Reducing Goodhart” for what I (or me from a few years ago) think the impact is on the alignment problem.
the fact that humans evolved from natural selection tells us a lot of what they probably want,
Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.
Pick a goal, and it’s easy to say what’s required. But pick a human, and it’s not easy to say what their goal is.
Is my goal to survive? And yet I take plenty of risky actions like driving that trade that off against other things. And even worse, I deliberately undergo some transformative experiences (e.g. moving to a different city and making a bunch of new friends) that in some sense “make me a different person.” And even worse, sometimes I’m irrational or make mistakes, but under different interpretations of my behavior different things are irrational. If you interpret me as really wanting to survive, driving is an irrational thing I do because it’s common in my culture and I don’t have a good intuitive feel for statistics. If you interpret me a different way, maybe my intuitive feeling gets interpreted as more rational but my goal changes from survival to something more complicated.
More complicated yes, but I assume the question is whether superintelligent AIs can understand what you want “overall” at least as good as other humans. And here, I would agree with ozziegooen, the answer seems to be yes—even if they otherwise tend to reason about things differently than we do. Because there seems to be a fact of the matter about what you want overall, even if it is not easy to predict. But predicting it is not obviously inhibited by a tendency to think in different terms (“ontology”). Is the worry perhaps that the AI finds the concept of “what the human wants overall” unnatural, so is unlikely to optimize for it?
“It sure seems like there’s a fact of the matter” is not a very forceful argument to me, especially in light of things like it being impossible to uniquely fit a rationality model and utility function to human behavior.
If there was no fact of the matter of what you want overall, there would be no fact of the matter of whether an AI is aligned with you or not. Which would mean there is no alignment problem.
The referenced post seems to apply specifically to IRL, which is purely based on behaviorism and doesn’t take information about the nature of the agent into account. (E.g. the fact that humans evolved from natural selection tells us a lot of what they probably want, and information about their brain could tell us how intelligent they are.) It’s also only an epistemic point about the problem of externally inferring values, not about those values not existing.
See my sequence “Reducing Goodhart” for what I (or me from a few years ago) think the impact is on the alignment problem.
Sure. But only if you already know what evolved creatures tend to want. I.e. once you have already made interpretive choices in one case, you can get some information on how well they hang together with other cases.