In my previous post are extrapolation-based AIs alignable? I argued that an AI trained only to extrapolate some dataset (like an LLM) can’t really be aligned, because it wouldn’t know what information can be shared when and with whom.
Mostly because this is not, in fact the task of alignment.
A better formulation of the alignment goal is this:
When thinking about AIs that are trained on some dataset and learn to extrapolate it, like the current crop of LLMs, I asked myself: can such an AI be aligned purely by choosing an appropriate dataset to train on? In other words, does there exist any dataset such that generating extrapolations from it leads to good outcomes from the perspective of the actor controlling the AI?
Problem I see, our values are defined in a stable way only inside the distribution. I.e. for the situations which are similar to those we have already experienced.
Outside of it there may be many radically different extrapolations which are consistent with themselves and with our values inside the distribution. And it’s problem not with AI, but with the values themselves.
For example, there is no correct answer about what the human is. I.e. how much we can “improve” the human until it stops being a human. We can choose different answers and they will all be consistent with out pre-singularity concept of the human, and do not contradict with already established values.
Yeah. Or rather, we do have one possible answer—let the person themselves figure out by what process they want to be extrapolated, as steven0461 explained in this old thread—but that answer isn’t very good, as it’s probably very sensitive to initial conditions, like which brand of coffee you happened to drink before you started self-extrapolating.
This is actually a problem, but I do not believe there’s a single answer to that question, indeed I suspect there are an infinite number of valid ways to answer the question (once we consider multiverses)
And I think the sensitivity to initial condition and assumptions is exactly what morality and values have. That is, one can freely change you assumptions, thus leading to inconsistent but complete morality.
The point is that your starting assumptions and conditions matter for what you eventually want to end up in.
Mostly because this is not, in fact the task of alignment.
A better formulation of the alignment goal is this:
This is our actual task for alignment.
Problem I see, our values are defined in a stable way only inside the distribution. I.e. for the situations which are similar to those we have already experienced.
Outside of it there may be many radically different extrapolations which are consistent with themselves and with our values inside the distribution. And it’s problem not with AI, but with the values themselves.
For example, there is no correct answer about what the human is. I.e. how much we can “improve” the human until it stops being a human. We can choose different answers and they will all be consistent with out pre-singularity concept of the human, and do not contradict with already established values.
Yeah. Or rather, we do have one possible answer—let the person themselves figure out by what process they want to be extrapolated, as steven0461 explained in this old thread—but that answer isn’t very good, as it’s probably very sensitive to initial conditions, like which brand of coffee you happened to drink before you started self-extrapolating.
“Making decision oneself” will also become a very vague concept when superconvincing AIs are running around.
This is actually a problem, but I do not believe there’s a single answer to that question, indeed I suspect there are an infinite number of valid ways to answer the question (once we consider multiverses)
And I think the sensitivity to initial condition and assumptions is exactly what morality and values have. That is, one can freely change you assumptions, thus leading to inconsistent but complete morality.
The point is that your starting assumptions and conditions matter for what you eventually want to end up in.