Thane Ruthenis comments on How To Go From Interpretability To Alignment: Just Retarget The Search

Thane Ruthenis 10 Aug 2022 17:22 UTC
11 points
3
As a bonus, I’m pretty sure this approach is robust to the sharp left turn. Once the AI becomes advanced enough that most of its capability gain comes from its own computations, as opposed to the SGD^[1], it won’t want to change its mesa-objective. Indeed, it would do its best to avoid this outcome, up to hacking the training process and/or escaping it. If we goal-align it before this point, in a way that is robust to any ontology shifts (addressing which may be less difficult than it seems at the surface), it’ll stay aligned forever.
In particular, this way we’ll never have to deal with a lot of the really nasty stuff, like a superintelligence trying to foil our interpretability tools — a flawed mesa-objective is the ultimate precursor to that.
1. ^
  Like ~all of a modern human’s capability edge over a 50k-years-ago caveman comes from the conceptual, scientific, and educational cognitive machinery other humans have developed, not from evolution.