I’m going to try to have a post that’s pretty similar to this out soon.
A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the “micro-level” parameters of the AI. “Connect” means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters. Neither of these directions of connection have to be don’t purely “internal to the AI,” without reference to the environment it’s in—it’s completely valid (and indeed necessary) to fit an agent into a model of its environment in the process of talking about its goals. Being able to talk about “the AI’s objectives” is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn’t the only way to make progress! We need to build our general capability to connect AIs’ parameters to any useful abstract model at all.
It may be the case that any model we actually have cause to worry about—that has the capacity to do long-term deceptive planning—requires mechanistic internalization of some objective which the mesa-optimizer can use, faster than gradient descent can activate specific subroutines for this.
I don’t think we get off so easy. Even in humans, we’re pretty general and powerful despite not having some super-obvious locus of motivation outside the “intended” motivational system that trained our neocortex. It’s just that our capabilities have generalized faster than that motivational system has, and so we do things like invent Doritos even though on-distribution analysis of humans in the ancestral environment might have inferred that we want nutritious food.
A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the “micro-level” parameters of the AI. “Connect” means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters.
Yeah, this is broadly right. The mistake I was making earlier while working on this was thinking that my abstract model was good enough—I’ve since realized that this is the point of a large part of agent foundations work. It took doing this to realize that however and this framing isn’t exactly how I was viewing it but seems pretty cool, so thanks!
Being able to talk about “the AI’s objectives” is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn’t the only way to make progress! We need to build our general capability to connect AIs’ parameters to any useful abstract model at all.
Oh yeah I agree—hence my last section on other cases where what we want (identifying the thing that drives the AI’s cognition) isn’t as clear-cut as an internalized object. But I think focusing on the case of identifying an AI’s objectives (or what we want from that) might be a good place to start because everything else I can think of involves even more confused parts of the abstract model and multitude of cases! Definitely agree that we need to build general capacity, I expect there’s progress to be made from the direction of starting with complex abstract models that low-level interpretability would eventually scale to.
Even in humans, we’re pretty general and powerful despite not having some super-obvious locus of motivation outside the “intended” motivational system that trained our neocortex.
(Disclaimer: includes neurological conjectures that I’m far from familiar with) I agree with the general point that this would plausibly end up being more complicated, but to explain my slight lean toward what I said in the post: I think whatever our locus of motivation is, intuitively it’s plausibly still represented somewhere in our brain—i.e., that there are explicit values/objectives driving a lot of our cognition rather than just being value-agnostic contextually-activated reactions. Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain’s memory/whatever would be analogous to what I’m picturing here.
Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain’s memory/whatever would be analogous to what I’m picturing here.
Ah yeah, that makes sense for inference. Like if I’m planning some specific thing like “get a banana”, maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want—I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my “training objective” and seeing what parts of my brain change.
Oh yeah, I’m definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.
I’m going to try to have a post that’s pretty similar to this out soon.
A main claim is that the thing you want to be doing (not just a general you, I mean specifically the vibe I get from you in this post) is to build an abstract model of the AI and use interpretability to connect that abstract model to the “micro-level” parameters of the AI. “Connect” means doing things like on-distribution inference of abstract model parameters from actual parameters, or translating a desired change in the abstract model into a method for updating the micro-level parameters. Neither of these directions of connection have to be don’t purely “internal to the AI,” without reference to the environment it’s in—it’s completely valid (and indeed necessary) to fit an agent into a model of its environment in the process of talking about its goals. Being able to talk about “the AI’s objectives” is the special case when you have an abstract model of the AI that features objectives as modeled objects. But using such a model isn’t the only way to make progress! We need to build our general capability to connect AIs’ parameters to any useful abstract model at all.
I don’t think we get off so easy. Even in humans, we’re pretty general and powerful despite not having some super-obvious locus of motivation outside the “intended” motivational system that trained our neocortex. It’s just that our capabilities have generalized faster than that motivational system has, and so we do things like invent Doritos even though on-distribution analysis of humans in the ancestral environment might have inferred that we want nutritious food.
Yeah, this is broadly right. The mistake I was making earlier while working on this was thinking that my abstract model was good enough—I’ve since realized that this is the point of a large part of agent foundations work. It took doing this to realize that however and this framing isn’t exactly how I was viewing it but seems pretty cool, so thanks!
Oh yeah I agree—hence my last section on other cases where what we want (identifying the thing that drives the AI’s cognition) isn’t as clear-cut as an internalized object. But I think focusing on the case of identifying an AI’s objectives (or what we want from that) might be a good place to start because everything else I can think of involves even more confused parts of the abstract model and multitude of cases! Definitely agree that we need to build general capacity, I expect there’s progress to be made from the direction of starting with complex abstract models that low-level interpretability would eventually scale to.
(Disclaimer: includes neurological conjectures that I’m far from familiar with) I agree with the general point that this would plausibly end up being more complicated, but to explain my slight lean toward what I said in the post: I think whatever our locus of motivation is, intuitively it’s plausibly still represented somewhere in our brain—i.e., that there are explicit values/objectives driving a lot of our cognition rather than just being value-agnostic contextually-activated reactions. Planning in particular probably involves outcome evaluation based on some abstract metric. If this is true, then wherever those are stored in our brain’s memory/whatever would be analogous to what I’m picturing here.
Ah yeah, that makes sense for inference. Like if I’m planning some specific thing like “get a banana”, maybe you can read my mind by monitoring my use of some banana-related neurons. But I view such a representation more as an intermediate step in the chain of motivation and planning, with the upshot that interpretability on this level has a hard time being used to actually intervene on what I want—I want the banana as part of some larger process, and so rewiring the banana-neurons that were useful for inference might get routed around or otherwise not have the intended effects. This also corresponds to a problem with trying to locate goals in the neocortex by (somehow) changing my “training objective” and seeing what parts of my brain change.
Oh yeah, I’m definitely not thinking explicitly about instrumental goals here, I expect those would be a lot harder to locate/identify mechanistically. I was picturing something more along the lines of a situation where an optimizer is deceptive, for example, and needs to do the requisite planning which plausibly would be centered on plans that best achieve its actual objective. Unlike instrumental objectives, this seems to have a more compelling case for not just being represented in pure thought-space, rather being the source of the overarching chain of planning.