Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain’s memory/whatever would be analogous to what I’m picturing here.
Ah yeah, that makes sense for inference. Like if I’m planning some specific thing like “get a banana”, maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want—I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my “training objective” and seeing what parts of my brain change.
Oh yeah, I’m definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.
Ah yeah, that makes sense for inference. Like if I’m planning some specific thing like “get a banana”, maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want—I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my “training objective” and seeing what parts of my brain change.
Oh yeah, I’m definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.