On testing, however, the retrained MB* does not show any visible inclination like this. In retrospect, that made sense—it relied on the assumption that the internal representation of the objective is bidirectional, that the parameter-reward mapping is linear. A high-level update signal in one direction doesn’t necessitate that the inverted signal results in the inverted direction. This direction was a bust, but it was useful for me to make incorrect implicit assumptions like this more explicit.
I think it’s improbable that agents internalize a single objective, but I applaud your concrete hypothesis and then going out to test it. I’m very excited about people trying to predict what algorithms a policy net will be running, thereby grounding out “mesa objectives” and other such talk in terms of falsifiable predictions about internal cognition and thus generalization behavior (e.g. going towards coin or going towards right or something else weirder than that).
Do you think the default is that we’ll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can’t really be identified mechanistically as such, or that only processes where they’re really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies—figuring out all of them seems pretty useful.
I think it’s improbable that agents internalize a single objective, but I applaud your concrete hypothesis and then going out to test it. I’m very excited about people trying to predict what algorithms a policy net will be running, thereby grounding out “mesa objectives” and other such talk in terms of falsifiable predictions about internal cognition and thus generalization behavior (e.g. going towards coin or going towards right or something else weirder than that).
Do you think the default is that we’ll end up with a bunch of separate things that look like internalized objectives so that the one used for planning can’t really be identified mechanistically as such, or that only processes where they’re really useful would learn them and that there would be multiple of them (or a third thing)? In the latter case I think the same underlying idea still applies—figuring out all of them seems pretty useful.