Jozdien comments on Trying to isolate objectives: approaches toward high-level interpretability

Jozdien 13 Jan 2023 14:04 UTC
LW: 1 AF: 1
0
AF
Do you think the default is that we’ll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can’t really be identified mechanistically as such, or that only processes where they’re really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies—figuring out all of them seems pretty useful.