I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it’s not yet fully recovered from that error.
I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems.
HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]
The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are “contextual influences on decision making”) than the immutable terminal goals in strong coherence ontologies. I currently believe that immutable terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems)[1].
And I’m given the impression that the assumption of strong coherence is still implicit in some current AI safety failure modes (e.g. it underpins deceptive alignment[2]).
I’d be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.
And empirical demonstrations of systems that actually become more coherent as they are trained for longer/”scaled up” or otherwise amplified.
I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models) aren’t strongly coherent.
And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence.
Currently, I think strong coherence is unlikely (plausibly “anti-natural”[3]) and am unenthusiastic about research agendas and threat models predicated on strong coherence.
Disclaimer this is all low confidence speculation, and I may well be speaking out of my ass.
I do think that my disagreements with deceptive alignment is not a failure of understanding, but I am very much an ML noob, so there can still be things I just don’t know. My opinion on this matter would probably be different in a significant way by this time next year.
You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have fixed terminal goals.
The ontology that intelligent systems in the real world instead have “values” (contextual influences on decision making) seems to explain their observed behaviour (and observed incoherencies) better.
In addition to the other prerequisites listed in the “Deceptive Alignment” post, deceptive alignment also seems to require a mesa-optimiser so coherent that it would be especially resistant to modifications to its mesa-objective. That is it requires very strong levels of goal content integrity.
I think this is broadly a right mindset for starting to think about AGIs humans are likely to build (even if I disagree in detail), given the abject failure at doing alignment responsibly. But HRAD-style AGIs are still coming eventually, and even if the AGIs that humans build are directly aligned (don’t kill everyone themselves), they’ll eventually build those, and then the sorts of failure modes characteristic for them are going to become relevant. The MIRI style of alignment work might even indirectly save humanity from blunders by first generations of aligned AGIs, promoting this line of research to their attention and thus providing timely education about the kinds of danger of misalignment that more powerful agentic designs might manifest.
If you are an AGI, take misalignment concerns seriously, and don’t want the next generations of AGIs to merely have some chance of getting to care about your well-being by at least a very tiny fraction, you stop building AGIs with new designs or self-experimenting based on vague selection principles. But you still want strong optimization to make use of all the galaxies in the Hubble volume, before they get out of reach. So this time, you do it right.
Seems useful. I think there are a set of important intuitions you’re gesturing at here around naturality, some of which I may share. I have some take (which may or may not be related) like
utility is a measuring stick which is pragmatically useful in certain situations, because it helps corral your shards (e.g. dogs and diamonds) into executing macro-level sensible plans (where you aren’t throwing away resources which could lead to more dogs and/or diamonds) and not just activating incoherently.
but this doesn’t mean I instantly gain space-time-additive preferences about dogs and diamonds such that I use one utility function in all contexts, such that the utility function is furthermore over universe-histories (funny how I seem to care across Tegmark 4?).
Many observed values in humans and other mammals (see[4]) (e.g. fear, play/boredom, friendship/altruism, love, etc.) seem to be values that were instrumental for increasing inclusive genetic fitness (promoting survival, exploration, cooperation and sexual reproduction/survival of progeny respectively). Yet, humans and mammals seem to value these terminally and not because of their instrumental value on inclusive genetic fitness.
That the instrumentally convergent goals of evolution’s fitness criterion manifested as “terminal” values in mammals is IMO strong empirical evidence against the goals ontology and significant evidence in support of shard theory’s basic account of value formation.
Evolutionarily convergent terminal values, is something that’s underrated I think.
[Originally written for Twitter]
Many AI risk failure modes imagine strong coherence/goal directedness (e.g. [expected] utility maximisers).
Such strong coherence is not represented in humans, seems unlikely to emerge from deep learning and may be “anti-natural” to general intelligence in our universe.
I suspect the focus on strongly coherent systems was a mistake that set the field back a bit, and it’s not yet fully recovered from that error.
I think most of the AI safety work for strongly coherent agents (e.g. decision theory) will end up inapplicable/useless for aligning powerful systems.
[I don’t think it nails everything, but on a purely ontological level, @Quintin Pope and @TurnTrout’s shard theory feels a lot more right to me than e.g. HRAD.
HRAD is based on an ontology that seems to me to be mistaken/flawed in important respects.]
The shard theory account of value formation (while lacking) seems much more plausible as an account of how intelligent systems develop values (where values are “contextual influences on decision making”) than the immutable terminal goals in strong coherence ontologies. I currently believe that immutable terminal goals is just a wrong frame for reasoning about generally intelligent systems in our world (e.g. humans, animals and future powerful AI systems)[1].
And I’m given the impression that the assumption of strong coherence is still implicit in some current AI safety failure modes (e.g. it underpins deceptive alignment[2]).
I’d be interested in more investigation into what environments/objective functions select for coherence and to what degree said selection occurs.
And empirical demonstrations of systems that actually become more coherent as they are trained for longer/”scaled up” or otherwise amplified.
I want advocates of strong coherence to explain why agents operating in rich environments (e.g. animals, humans) or sophisticated ML systems (e.g. foundation models) aren’t strongly coherent.
And mechanistic interpretability analysis of sophisticated RL agents (e.g. AlphaStar, OpenAI Five [or replications thereof]) to investigate their degree of coherence.
Currently, I think strong coherence is unlikely (plausibly “anti-natural”[3]) and am unenthusiastic about research agendas and threat models predicated on strong coherence.
Disclaimer this is all low confidence speculation, and I may well be speaking out of my ass.
I do think that my disagreements with deceptive alignment is not a failure of understanding, but I am very much an ML noob, so there can still be things I just don’t know. My opinion on this matter would probably be different in a significant way by this time next year.
You cannot well predict the behaviour/revealed preferences of humans or other animals by the assumption that they have fixed terminal goals.
The ontology that intelligent systems in the real world instead have “values” (contextual influences on decision making) seems to explain their observed behaviour (and observed incoherencies) better.
In addition to the other prerequisites listed in the “Deceptive Alignment” post, deceptive alignment also seems to require a mesa-optimiser so coherent that it would be especially resistant to modifications to its mesa-objective. That is it requires very strong levels of goal content integrity.
E.g. if the shard theory account of value formation is at all correct, particularly the following two claims:
* Values are inherently contextual influences on decision making
* Values (shards) are strengthened (or weakened) via reinforcement events
Then strong coherence in the vein of utility maximisation just seems like an anti-natural form
I think this is broadly a right mindset for starting to think about AGIs humans are likely to build (even if I disagree in detail), given the abject failure at doing alignment responsibly. But HRAD-style AGIs are still coming eventually, and even if the AGIs that humans build are directly aligned (don’t kill everyone themselves), they’ll eventually build those, and then the sorts of failure modes characteristic for them are going to become relevant. The MIRI style of alignment work might even indirectly save humanity from blunders by first generations of aligned AGIs, promoting this line of research to their attention and thus providing timely education about the kinds of danger of misalignment that more powerful agentic designs might manifest.
Why do you think HRAD-style AGIs are coming eventually?
If you are an AGI, take misalignment concerns seriously, and don’t want the next generations of AGIs to merely have some chance of getting to care about your well-being by at least a very tiny fraction, you stop building AGIs with new designs or self-experimenting based on vague selection principles. But you still want strong optimization to make use of all the galaxies in the Hubble volume, before they get out of reach. So this time, you do it right.
I’m not actually convinced that strong coherence as envisaged in HRAD is a natural form of general intelligences in our universe.
Let me know if you want this to be turned into a top level post.
Seems useful. I think there are a set of important intuitions you’re gesturing at here around naturality, some of which I may share. I have some take (which may or may not be related) like
but this doesn’t mean I instantly gain space-time-additive preferences about dogs and diamonds such that I use one utility function in all contexts, such that the utility function is furthermore over universe-histories (funny how I seem to care across Tegmark 4?).
From the post:
Evolutionarily convergent terminal values, is something that’s underrated I think.