leogao comments on The Plan − 2023 Version

leogao 30 Dec 2023 18:07 UTC
11 points
3
Short answer: The core focus of the “yet to be worked out techniques” is to figure out the “how do we get it to generalize properly” part, not the “how do we be super careful with the labels” part.

Longer answer: We can consider weak to strong generalization as actually two different subproblems:
- generalizing from correct labels on some easy subset of the distribution (the 10,000 super careful definitely 100% correct labels)
- generalizing from labels which can be wrong and are more correct on easy problems than hard problems, but we don’t exactly know when the labels are wrong (literally just normal human labels)
The setting in the paper doesn’t quite distinguish between the two but I personally think the former problem is more interesting and contains the bulk of the difficulty. Namely, most of the difficulty is in understanding when generalization happens/fails and what kinds of generalizations are more natural.
What links here?
- Deceptive AI ≠ Deceptively-aligned AI by Steven Byrnes (7 Jan 2024 16:55 UTC; 96 points)