Jozdien comments on Trying to isolate objectives: approaches toward high-level interpretability

Jozdien 11 Jan 2023 14:10 UTC
1 point
0
Oh yeah, I’m definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.