Rohin Shah comments on Don’t align agents to evaluations of plans

Rohin Shah 18 Dec 2022 8:23 UTC
4 points
−1
Sure, that all sounds reasonable to me; I think we’ve basically converged.
I don’t understand why we wouldn’t use that tech to directly shape the agent to want what we want it to want.
The main reason is that we don’t know ourselves what we want it to want, and we would instead like it to follow some process that we like (e.g. just do some scientific innovation and nothing else, help us do better philosophy to figure out what we want, etc). This sort of stuff seems like a poor fit for values-executors. Probably there will be some third, totally different mental architecture for such tasks, but if you forced me to choose between values-executors or grader-optimizers, I’d currently go with grader-optimizers.