A few years ago I thought about a problem which I think is the same thing you’re pointing to here—no perfect feedback, uncertainty and learning all the way up the meta-ladder, etc. My attempt at a solution was quite different.
The basic idea is to use a communication prior—a prior which says “someone is trying to communicate with you”.
With an idealized communication prior, our update is not P[Y|X], but instead P[Y|M], where (roughly) M = “X maximizes P[Y|M]” (except we unroll the fixed point to make initial condition of iteration explicit). Interpretation: the “message sender” chooses the value of X which results in us assigning maximum probability to Y, and we update on this fact. If you’ve played Codenames, this leads to similar chains of logic: “well, ‘blue’ seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said ‘weather’ or something like that instead, so it’s probably sky+sapphire...”. As with recursive quantilizers, the infinite meta-tower collapses into a fixed-point calculation, and there’s (hopefully) a basin of attraction.
To make this usable for alignment purposes we need a couple modifications.
First, obviously, humans are not perfectly rational and logically omniscient, so we have to replace “X maximizes P[Y|M]” with “<rough model of human> thinks X will produce high P[Y|M]”. The better the human-model, the broader the basin of attraction for the whole thing to work.
Second, we have to say what the “message” X from the human is, and what Y is. Y would be something like human values, and X would include things like training data and/or explicit models. In principle, we could get uncertainty and learning “at the outermost level” by having the system treat its own source code as a “message” from the human—the source code is, after all, something the human chose expecting that it would produce a good estimate of human values. On the other hand, if the source code contained an error (that didn’t just mess up everything), then the system could potentially recognize it as an error and then do something else.
Finally, obviously the “initial condition” of the iteration would have to be chosen carefully—that’s basically just a good-enough world model and human-values-pointer. In a sense, we’re trying to formalize “do what I mean” enough that the AI can figure out what we mean.
I like that these ideas can be turned into new learning paradigms relatively easily.
I think there’s obviously something like your proposal going on, but I feel like it’s the wrong place to start.
It’s important that the system realize it has to model human communication as an attempt to communicate something, which is what you’re doing here. It’s something utterly missing from my model as written.
However, I feel like starting from this point forces us to hard-code a particular model of communication, which means the system can never get beyond this. As you said:
First, obviously, humans are not perfectly rational and logically omniscient, so we have to replace “X maximizes P[Y|M]” with “<rough model of human> thinks X will produce high P[Y|M]”. The better the human-model, the broader the basin of attraction for the whole thing to work.
I would rather attack the problem of specifying what it could mean for a system to learn at all the meta levels in the first place, and then teach such a system about this kind of communication model as part of its broader education about how to avoid things like wireheading, human manipulation, treacherous turns, and so on.
Granted, you could overcome the hardwired-ness of the communication model if your “treat the source code as a communication, too” idea ended up allowing a reinterpretation of the basic communication model. That just seems very difficult.
All this being said, I’m glad to hear you were working on something similar. Your idea obviously starts to get at the “interpretable feedback” idea which I basically failed to make progress on in my proposal.
Yeah, I largely agree with this critique. The strategy relies heavily on the AI being able to move beyond the initial communication model, and we have essentially no theory to back that up.
A few years ago I thought about a problem which I think is the same thing you’re pointing to here—no perfect feedback, uncertainty and learning all the way up the meta-ladder, etc. My attempt at a solution was quite different.
The basic idea is to use a communication prior—a prior which says “someone is trying to communicate with you”.
With an idealized communication prior, our update is not P[Y|X], but instead P[Y|M], where (roughly) M = “X maximizes P[Y|M]” (except we unroll the fixed point to make initial condition of iteration explicit). Interpretation: the “message sender” chooses the value of X which results in us assigning maximum probability to Y, and we update on this fact. If you’ve played Codenames, this leads to similar chains of logic: “well, ‘blue’ seems like a good hint for both sky+sapphire and sky+water, but if it were sky+water they would have said ‘weather’ or something like that instead, so it’s probably sky+sapphire...”. As with recursive quantilizers, the infinite meta-tower collapses into a fixed-point calculation, and there’s (hopefully) a basin of attraction.
To make this usable for alignment purposes we need a couple modifications.
First, obviously, humans are not perfectly rational and logically omniscient, so we have to replace “X maximizes P[Y|M]” with “<rough model of human> thinks X will produce high P[Y|M]”. The better the human-model, the broader the basin of attraction for the whole thing to work.
Second, we have to say what the “message” X from the human is, and what Y is. Y would be something like human values, and X would include things like training data and/or explicit models. In principle, we could get uncertainty and learning “at the outermost level” by having the system treat its own source code as a “message” from the human—the source code is, after all, something the human chose expecting that it would produce a good estimate of human values. On the other hand, if the source code contained an error (that didn’t just mess up everything), then the system could potentially recognize it as an error and then do something else.
Finally, obviously the “initial condition” of the iteration would have to be chosen carefully—that’s basically just a good-enough world model and human-values-pointer. In a sense, we’re trying to formalize “do what I mean” enough that the AI can figure out what we mean.
Maybe I’ll write up a post on this tomorrow.
I like that these ideas can be turned into new learning paradigms relatively easily.
I think there’s obviously something like your proposal going on, but I feel like it’s the wrong place to start.
It’s important that the system realize it has to model human communication as an attempt to communicate something, which is what you’re doing here. It’s something utterly missing from my model as written.
However, I feel like starting from this point forces us to hard-code a particular model of communication, which means the system can never get beyond this. As you said:
I would rather attack the problem of specifying what it could mean for a system to learn at all the meta levels in the first place, and then teach such a system about this kind of communication model as part of its broader education about how to avoid things like wireheading, human manipulation, treacherous turns, and so on.
Granted, you could overcome the hardwired-ness of the communication model if your “treat the source code as a communication, too” idea ended up allowing a reinterpretation of the basic communication model. That just seems very difficult.
All this being said, I’m glad to hear you were working on something similar. Your idea obviously starts to get at the “interpretable feedback” idea which I basically failed to make progress on in my proposal.
Yeah, I largely agree with this critique. The strategy relies heavily on the AI being able to move beyond the initial communication model, and we have essentially no theory to back that up.
Still interested in your write-up, though!
It’s up.