I’m particularly impressed by “The Floating Droid”. This can be seen as early-manifesting the foreseeable difficulty where:
At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis “It’s what a human rates”, and so has object-level hypotheses about environmental features that directly cause good or bad ratings;
When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress the same data;
Then, being smart, the AGI goodharts a new option that pries apart the ‘spurious’ regularity (human psychology, what fools humans) from the ‘intended’ regularity the humans were trying to gesture at (what we think of as actually good or bad outcomes).
I’m particularly impressed by “The Floating Droid”. This can be seen as early-manifesting the foreseeable difficulty where:
At kiddie levels, a nascent AGI is not smart enough to model humans and compress its human feedback by the hypothesis “It’s what a human rates”, and so has object-level hypotheses about environmental features that directly cause good or bad ratings;
When smarter, an AGI forms the psychological hypothesis over its ratings, because that more sophisticated hypothesis is now available to its smarter self as a better way to compress the same data;
Then, being smart, the AGI goodharts a new option that pries apart the ‘spurious’ regularity (human psychology, what fools humans) from the ‘intended’ regularity the humans were trying to gesture at (what we think of as actually good or bad outcomes).
In this particular experiment, the small models did not have an object-level hypotheses. It just had no clue and answered randomly.
I think the experiment shows that sometimes smaller models are too dumb to pick up the misleading correlation, which can though off bigger models.