Max Harms comments on 0. CAST: Corrigibility as Singular Target

Max Harms 8 Jun 2024 16:23 UTC
3 points
1
I share your sense of doom around SGD! It seems to be the go-to method, there are no good guarantees about what sorts of agents it produces, and that seems really bad. Other researchers I’ve talked to, such as Seth Herd share your perspective, I think. I want to emphasize that none of CAST per se depends on SGD, and I think it’s still the most promising target in superior architectures.

That said, I disagree that corrigibility is more likely to “get attracted by things that are nearby but not it” compared to a Sovereign optimizing for something in the ballpark of CEV. I think hill-climbing methods are very naturally distracted by proxies of the real goal (e.g. eating sweet foods is a proxy of inclusive genetic fitness), but this applies equally, and is thus damning for training a CEV maximizer as well.

I’m not sure one can train an already goal-stabilized AGI (such as Survival-Bot which just wants to live) into being corrigible post-hoc, since it may simply learn that behaving/thinking corrigibly is the best way to shield its thoughts from being distorted by the training process (and thus surviving). Much of my hope in SGD routes through starting with a pseudo-agent which hasn’t yet settled on goals and which doesn’t have the intellectual ability to be instrumentally corrigible.