Inner misalignment is when we have “objective function” (reward, loss function, etc.) and select systems that produce better results according this function (using evolutionary search, SGD, etc) and resulting system doesn’t produce actions which optimize this objective function. The most obvious example of inner misalignment is RL-trained agents that doesn’t maximize reward.
Your argument against possibility of inner misalignment is, basically, “SGD is so powerful optimizer that no matter what it will drag the system towards minimum of loss function”. Let’s suppose this is true.
We don’t have “good” outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
If you have some alignment benchmark, you can’t see the difference between superhumanly capable aligned and deceptively aligned systems. They both give you correct answers, because they both are superhumanly capable.
Because they give you the same correct answers, loss function assignes minimal values to their outputs. They are both either inside local minimum or on flat basin of loss function landscape.
Therefore, you don’t need inner misalignment to get deceptive alignment.
We don’t have “good” outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this:
Place a weak prior over goal space, such that there already is a bias towards say being helpful.
Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify.
Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss.
After millions of iterations of that loop by SGD, you can get a very aligned agent.
This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.
Okay, let’s break down this.
Inner misalignment is when we have “objective function” (reward, loss function, etc.) and select systems that produce better results according this function (using evolutionary search, SGD, etc) and resulting system doesn’t produce actions which optimize this objective function. The most obvious example of inner misalignment is RL-trained agents that doesn’t maximize reward.
Your argument against possibility of inner misalignment is, basically, “SGD is so powerful optimizer that no matter what it will drag the system towards minimum of loss function”. Let’s suppose this is true.
We don’t have “good” outer function, defined over training data, such that, given observation and action, this function scores action higher if this action, given observation, is better. Instead of this we have outer functions that favors things like good predictions and outputs receiving high score from human/AI overseer.
If you have some alignment benchmark, you can’t see the difference between superhumanly capable aligned and deceptively aligned systems. They both give you correct answers, because they both are superhumanly capable.
Because they give you the same correct answers, loss function assignes minimal values to their outputs. They are both either inside local minimum or on flat basin of loss function landscape.
Therefore, you don’t need inner misalignment to get deceptive alignment.
While I dislike using the framing of loss functions here, I do think that this is probably false, especially with even weak prior information about the shape of the alignment solutions. This might turn out to be a crux, but I do think that rewarding AIs for bad actions will likely be rare, at least in the regime where we can supervise things, and in particular, I think a hypothetical alignment scheme via an outer function would look like this:
Place a weak prior over goal space, such that there already is a bias towards say being helpful.
Use the fact that we are the innate reward system to use backpropagation to compute the optimal direction towards being helpful, or really any criterion we can specify.
Repeat reinforcing preferred values and not rewarding/disrewarding dispreferred values with backpropagation until it gets to minimum loss or near minimal loss.
After millions of iterations of that loop by SGD, you can get a very aligned agent.
This is roughly how I believe that the innate reward system manages to align us with values like empathy for the ingroup, but really we could replace the backprop algorithm with bio-realistic algorithms, and we could replace the values with mostly arbitrary values and get the same results.