Jai comments on Categorizing failures as “outer” or “inner” misalignment is often confused

Jai 9 Jan 2023 3:56 UTC
1 point
0
AF
The second scenario omits the details about continuing to create and submit pull requests after takeover, instead just referring to human farms. Since it doesn’t explicitly say that it’s still optimizing for the original objective criteria and instead just refers to world domination, it appears to be inner misalignment (e.g. no longer aligned with the original optimizer). Did the original posing of this question specify that scenario 2 still maximizes pull requests after world domination?
- Rohin Shah 9 Jan 2023 10:50 UTC
  2 points
  0
  Parent
  The intent was that in scenario 2 the AI constructs the human farms in order to to have the humans continually merging pull requests (i.e. yes, it still maximizes pull requests). I don’t think I explicitly said this—I probably gave the same text as in this post—but I expect people to have made this inference (because (a) otherwise it’s not clear why the AI would build human farms and (b) when I showed the inconsistency to the survey-takers, to my knowledge none of them said anything along the lines of “oh I was assuming that in scenario 2 the AI was no longer maximizing pull requests”).