“Goodhart” is no longer part of my native ontology for considering alignment failures. When I hear “The AI goodharts on some proxy of human happiness”, I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like:
Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain
Warning: No history defined. How did we get here?
Execute search for plausible training histories which produced this inner cognition
Proposal: Reward schedule around approval and making people laugh; historical designers had insufficient understanding of outer signal->inner cognition mapping; designers accidentally provided reinforcement which empowered smile-activation and manipulate-internal-human-state-to-high-pleasure shards
Objection: Concepts too human, this story is suspicious. Even conditioning on outcome, how did we get here? Why are there not more value shards? How did shard negotiation dynamics play out?
Meta-objection: Noted, but your interlocutor's point probably doesn't require figuring this out.
I think that Goodhart is usually describing how the AI “takes advantage of” some fixed outer objective. But in my ontology, there isn’t an outer objective—just inner cognition. So I have to do more translation.
There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it’s then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that’s not “correctly prompted” by behaviors in original situations.
It’s like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe “behavioral invariants” are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/”robustness”-hardened model prepared for use in the new situations).
“Goodhart” is no longer part of my native ontology for considering alignment failures. When I hear “The AI goodharts on some proxy of human happiness”, I start trying to fill in a concrete example mind design which fits that description and which is plausibly trainable. My mental events are something like:
Condition on: AI with primary value shards oriented around spurious correlate of human happiness; AI exhibited deceptive alignment during training, breaking perceived behavioral invariants during its sharp-capabilities-gain
Warning: No history defined. How did we get here?
Execute search for plausible training histories which produced this inner cognition
Proposal: Reward schedule around approval and making people laugh; historical designers had insufficient understanding of outer signal->inner cognition mapping; designers accidentally provided reinforcement which empowered smile-activation and manipulate-internal-human-state-to-high-pleasure shards
Objection: Concepts too human, this story is suspicious. Even conditioning on outcome, how did we get here? Why are there not more value shards? How did shard negotiation dynamics play out?
Meta-objection: Noted, but your interlocutor's point probably doesn't require figuring this out.
I think that Goodhart is usually describing how the AI “takes advantage of” some fixed outer objective. But in my ontology, there isn’t an outer objective—just inner cognition. So I have to do more translation.
There might be a natural concept for this that reframes deceptive alignment in the direction of reflection/extrapolation. Looking at deceptive alignment as a change of behavior not in response to capability gain, but instead as a change in response to stepping into a new situation, it’s then like a phase change in the (unchanging) mapping from situations to behaviors (local policies). The behaviors of a model suddenly change as it moves to similar situations, in a way that’s not “correctly prompted” by behaviors in original situations.
It’s like a robustness failure, but with respect to actual behavior in related situations, rather than with respect to some outer objective or training/testing distribution. So it seems more like a failure of reflection/extrapolation, where behavior in new situations should be determined by coarse-grained descriptions of behavior in old situations (maybe “behavioral invariants” are something like that; or just algorithms) rather than by any other details of the model. Aligned properties of behavior in well-tested situations normatively-should screen off details of the model, in determining behavior in new situations (for a different extrapolated/”robustness”-hardened model prepared for use in the new situations).