Koen.Holtman comments on The Inner Alignment Problem

Koen.Holtman 18 Aug 2019 10:19 UTC
1 point
I believe you might have been thinking in your reply above about a sub-set of all possible base objective functions: functions that compute a single ‘pass/fail’ value at a natural end of the training run, e.g. ‘the car never crashed’ or ‘all parts of the floor have been swept’. I was thinking of incrementally scoring objective functions, basically functions that sum utility increments achieved over time. So at any time during a run you can measure and compute the base objective function score up to that time. Monitoring this score should allow you to detect many forms of non-alignment between the base objective and the mesa objective automatically.
As mentioned, I see this as a promising technique for risk mitigation, it is not supposed to be a watertight way to eliminate all risks. The technique considered only looks at the score achieved so far. It does not run models to extrapolate and score the long-term consequences of every action: this would indeed be difficult, While observed past good performance does not guarantee future good performance, an observation of past bad performance does give you a very useful safety signal.