The classical MIRI views imagines human values to be a tiny squiggle in a vast space of alien minds. The unfathomable inscrutable process of deep learning is very unlikely to pick exactly that tiny squiggle, instead converging to a fundamentally incompatible and deeply alien squiggle. Therein lies the road to doom.
Optimists will object that deep learning doesn’t randomly sample from the space of alien minds. It is put under a strong gradient pressure to satisfy human preference in-distribution / during the training phase. One could, and many people have, similarly object that it’s hard or even impossible for deep learning systems to learn concepts that aren’t naive extrapolations of its training data[cf symbol grounding talk]. In fact, Claude is very able to verbalize human ethics and values.
Any given behaviour and performance on the training set is compatible with any given behaviour outside the training set. One can hardcode backdoors into a neural network that can behave nicely on training and arbitrarily differently outside training. Moreover, these backdoors can be implemented in such a way as to be computationally intractable to resolve. In other words, AIs would be capable of encrypting their thoughts (’steganography) and arbitrarily malevolent ingenious scheming in such a way that it is compute-physically impossible to detect.
Possible does not mean plausible. That arbitrarily undetectable scheming AIs are possible doesn’t mean they will actually arise. In other words, alignment is really about the likelihood of sampling different kinds of AI minds. MIRI says it’s a bit like picking a tiny squigle from a vast space of alien minds. Optimists think AIs will be aligned-by-default because they have been trained to do so.
The key insight of free energy decomposition is that any process of selection or learning involves two opposing forces. First, there’s an “entropic” force that pushes toward random sampling from all possibilities—like how a gas naturally spreads to fill a room. Second, there’s an “energetic” force that favors certain outcomes based on some criteria—like how gravity pulls objects downward. In AI alignment, the entropic force pulls toward sampling random minds from the vast space of possible minds, while the energetic force (from training) pulls toward minds that behave as we want. The actual outcome depends on which force is stronger. This same pattern shows up across physics (free energy), statistics (complexity-accuracy tradeoff), and machine learning (regularization vs. fit), Bayesian statistics (Watanabe’s free energy formula), algorithmic information theory (minimum description length).
In short (mis)alignment is about the free energy of human values in the vast space of alien minds. How general are free-energy decomposition in this sense? There are situations where the relevant distribution is not a Boltzmann distribution (SGD in high noise regime) but in many cases it is (bayesian statistics, SGD in low noise regime approximately...) and we can describe likelihood of any outcome in terms of a free energy tradeoff.
Doomers think the entropic effect to sample a random alien mind from gargantuan mindspace, while optimists think the ‘energetic’ effect for trained and observed actions to dominate private thought and out-of-distribution action. Ultra-optimists believe even that large parts of mindspace are intrinsically friendly; that there are basins of docility; that the ‘entropic’ effect is good actually; that the arc of the universe is long but bends towards kindness.
In AI alignment, the entropic force pulls toward sampling random minds from the vast space of possible minds, while the energetic force (from training) pulls toward minds that behave as we want. The actual outcome depends on which force is stronger.
The MIRI view, I’m pretty sure, is that the force of training does not pull towards minds that behave as we want, unless we know a lot of things about training design we currently don’t.
MIRI is not talking about the randomness as in the spread of the training posterior as a function of random Bayesian sampling/NN initialization/SGD noise. The point isn’t that training is inherently random. It can be a completely deterministic process without affecting the MIRI argument basically at all. If everything were a Bayesian sample from the posterior and there was a single basin of minimum local learning coefficient corresponding to equivalent implementations of a single algorithm, then I don’t think this would by default make models any more likely to be aligned. The simplest fit to the training signal need not be an optimiser pointed at a terminal goal that maps to the training signal in a neat way humans can intuitively zero-shot without figuring out underlying laws. The issue isn’t that the terminal goals are somehow fundamentally random- that there is no clear one-to-one mapping from the training setup to the terminal goals. It’s that we early 21st century humans don’t know the mapping from the training setup to the terminal goals. Having the terminal goals be completely determined by the training criteria does not help us if we don’t know what training criteria map to terminal goals that we would like. It’s a random draw from a vast space from our[1] perspective because we don’t know what we’re doing yet.
Probability and randomness are in the mind, not the territory. MIRI is not alleging that neural network training is somehow bound to strongly couple to quantum noise.
I’m not following exactly what you are saying here so I might be collapsing some subtle point.
Let me preface that this is a shortform so half-baked by design so you might be completely right it’s confused.
Let me try and explain myself again.
I probably have confused readers by using the free energy terminology. What I mean is that in many cases (perhaps all) the probabilistic outcome of any process can be described in terms of a competition of between simplicity (entropy) and accuracy (energy) to some loss function.
Indeed, the simplest fit for a training signal might not be aligned. In some cases perhaps almost all fits for a training signal create an agent whose values are only a somewhat constrained by the training signal and otherwise randomly sampled conditional on doing well on the training signal. The “good” values might be only a small part of this subspace.
Perhaps you and Dmitry are saying the issue is not just an simplicity-accuracy / entropy-energy split but also a case that the training signal not perfectly “sampled from true goodly human values”. There would be another error coming from this incongruency?
I was excited by the first half, seeing you relate classic Agent Foundations thinking to current NN training regimes, and try to relate the optimist/pessimist viewpoints.
Then we hit free energy and entropy. These seem like needlessly complex metaphors, providing no strong insight on the strength of the factors pushing toward and pulling away from alignment.
Analyzing those “forces” or tendencies seems like it’s crucially important, but needs to go deeper than a metaphor or use a much more fitting metaphor to get traction.
Nonetheless, upvoted for working on the important stuff even when it’s hard!
I probably shouldnt have used the free energy terminology.
Does complexity accuracy tradeoff work better ?
To be clear, I very much dont mean these things as a metaphor. I am thinking there may be an actual numerical complexity—accuracy that is some elaboration of Watanabe s “free energy” formula that actually describes these tendencies.
I’m not sure I agree with this—this seems like you’re claiming that misalignment is likely to happen through random diffusion. But I think most worries about misalignment are more about correlated issues, where the training signal consistently disincentivizes being aligned in a subtle way (e.g. a stock trading algorithm manipulating the market unethically because the pressure of optimizing income at any cost diverges from the pressure of doing what its creators would want it to do). If diffusion were the issue, it would also affect humans and not be special to AIs. And while humans do experience value drift, cultural differences, etc., I think we generally abstract these issues as “easier” than the “objective-driven” forms of misalignment
I agree that Goodharting is an issue, and this has been discussed as a failure mode, but a lot of AI risk writing definitely assumed that something like random diffusion was a non-trivial component of how AI alignment failures happened.
For example, pretty much all of the reasoning around random programs being misaligned/bad is using the random diffusion argument.
The free energy talk probably confuses more than that it elucidates. Im not talking about random diffusion per se but connection between uniformly sampling and simplicity and simplicity-accuracy tradeoff.
Ive tried explaining more carefully where my thinking is currently at in my reply to lucius.
Also caveat that shortforms are halfbaked-by-design.
Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the “halfbaked-by-design” concept (if Cheeseboard can do it, it must be a good idea right? :)
I still don’t agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various “simplicity/generality” priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalization measurements, and so on to operationalize the inductive prior better). I don’t think there’s a consensus on whether generality is “good” or “bad”—I know Paul Christiano and ARC has gone both ways on this at various points.
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment.
By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.
I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins
Free energy and (mis)alignment
The classical MIRI views imagines human values to be a tiny squiggle in a vast space of alien minds. The unfathomable inscrutable process of deep learning is very unlikely to pick exactly that tiny squiggle, instead converging to a fundamentally incompatible and deeply alien squiggle. Therein lies the road to doom.
Optimists will object that deep learning doesn’t randomly sample from the space of alien minds. It is put under a strong gradient pressure to satisfy human preference in-distribution / during the training phase. One could, and many people have, similarly object that it’s hard or even impossible for deep learning systems to learn concepts that aren’t naive extrapolations of its training data[cf symbol grounding talk]. In fact, Claude is very able to verbalize human ethics and values.
Any given behaviour and performance on the training set is compatible with any given behaviour outside the training set. One can hardcode backdoors into a neural network that can behave nicely on training and arbitrarily differently outside training. Moreover, these backdoors can be implemented in such a way as to be computationally intractable to resolve. In other words, AIs would be capable of encrypting their thoughts (’steganography) and arbitrarily malevolent ingenious scheming in such a way that it is compute-physically impossible to detect.
Possible does not mean plausible. That arbitrarily undetectable scheming AIs are possible doesn’t mean they will actually arise. In other words, alignment is really about the likelihood of sampling different kinds of AI minds. MIRI says it’s a bit like picking a tiny squigle from a vast space of alien minds. Optimists think AIs will be aligned-by-default because they have been trained to do so.
The key insight of free energy decomposition is that any process of selection or learning involves two opposing forces. First, there’s an “entropic” force that pushes toward random sampling from all possibilities—like how a gas naturally spreads to fill a room. Second, there’s an “energetic” force that favors certain outcomes based on some criteria—like how gravity pulls objects downward. In AI alignment, the entropic force pulls toward sampling random minds from the vast space of possible minds, while the energetic force (from training) pulls toward minds that behave as we want. The actual outcome depends on which force is stronger. This same pattern shows up across physics (free energy), statistics (complexity-accuracy tradeoff), and machine learning (regularization vs. fit), Bayesian statistics (Watanabe’s free energy formula), algorithmic information theory (minimum description length).
In short (mis)alignment is about the free energy of human values in the vast space of alien minds. How general are free-energy decomposition in this sense? There are situations where the relevant distribution is not a Boltzmann distribution (SGD in high noise regime) but in many cases it is (bayesian statistics, SGD in low noise regime approximately...) and we can describe likelihood of any outcome in terms of a free energy tradeoff.
Doomers think the entropic effect to sample a random alien mind from gargantuan mindspace, while optimists think the ‘energetic’ effect for trained and observed actions to dominate private thought and out-of-distribution action. Ultra-optimists believe even that large parts of mindspace are intrinsically friendly; that there are basins of docility; that the ‘entropic’ effect is good actually; that the arc of the universe is long but bends towards kindness.
The MIRI view, I’m pretty sure, is that the force of training does not pull towards minds that behave as we want, unless we know a lot of things about training design we currently don’t.
MIRI is not talking about the randomness as in the spread of the training posterior as a function of random Bayesian sampling/NN initialization/SGD noise. The point isn’t that training is inherently random. It can be a completely deterministic process without affecting the MIRI argument basically at all. If everything were a Bayesian sample from the posterior and there was a single basin of minimum local learning coefficient corresponding to equivalent implementations of a single algorithm, then I don’t think this would by default make models any more likely to be aligned. The simplest fit to the training signal need not be an optimiser pointed at a terminal goal that maps to the training signal in a neat way humans can intuitively zero-shot without figuring out underlying laws. The issue isn’t that the terminal goals are somehow fundamentally random- that there is no clear one-to-one mapping from the training setup to the terminal goals. It’s that we early 21st century humans don’t know the mapping from the training setup to the terminal goals. Having the terminal goals be completely determined by the training criteria does not help us if we don’t know what training criteria map to terminal goals that we would like. It’s a random draw from a vast space from our[1] perspective because we don’t know what we’re doing yet.
Probability and randomness are in the mind, not the territory. MIRI is not alleging that neural network training is somehow bound to strongly couple to quantum noise.
I’m not following exactly what you are saying here so I might be collapsing some subtle point. Let me preface that this is a shortform so half-baked by design so you might be completely right it’s confused.
Let me try and explain myself again.
I probably have confused readers by using the free energy terminology. What I mean is that in many cases (perhaps all) the probabilistic outcome of any process can be described in terms of a competition of between simplicity (entropy) and accuracy (energy) to some loss function.
Indeed, the simplest fit for a training signal might not be aligned. In some cases perhaps almost all fits for a training signal create an agent whose values are only a somewhat constrained by the training signal and otherwise randomly sampled conditional on doing well on the training signal. The “good” values might be only a small part of this subspace.
Perhaps you and Dmitry are saying the issue is not just an simplicity-accuracy / entropy-energy split but also a case that the training signal not perfectly “sampled from true goodly human values”. There would be another error coming from this incongruency?
Hope you can enlighten me.
I was excited by the first half, seeing you relate classic Agent Foundations thinking to current NN training regimes, and try to relate the optimist/pessimist viewpoints.
Then we hit free energy and entropy. These seem like needlessly complex metaphors, providing no strong insight on the strength of the factors pushing toward and pulling away from alignment.
Analyzing those “forces” or tendencies seems like it’s crucially important, but needs to go deeper than a metaphor or use a much more fitting metaphor to get traction.
Nonetheless, upvoted for working on the important stuff even when it’s hard!
I probably shouldnt have used the free energy terminology. Does complexity accuracy tradeoff work better ?
To be clear, I very much dont mean these things as a metaphor. I am thinking there may be an actual numerical complexity—accuracy that is some elaboration of Watanabe s “free energy” formula that actually describes these tendencies.
I’m not sure I agree with this—this seems like you’re claiming that misalignment is likely to happen through random diffusion. But I think most worries about misalignment are more about correlated issues, where the training signal consistently disincentivizes being aligned in a subtle way (e.g. a stock trading algorithm manipulating the market unethically because the pressure of optimizing income at any cost diverges from the pressure of doing what its creators would want it to do). If diffusion were the issue, it would also affect humans and not be special to AIs. And while humans do experience value drift, cultural differences, etc., I think we generally abstract these issues as “easier” than the “objective-driven” forms of misalignment
I agree that Goodharting is an issue, and this has been discussed as a failure mode, but a lot of AI risk writing definitely assumed that something like random diffusion was a non-trivial component of how AI alignment failures happened.
For example, pretty much all of the reasoning around random programs being misaligned/bad is using the random diffusion argument.
The free energy talk probably confuses more than that it elucidates. Im not talking about random diffusion per se but connection between uniformly sampling and simplicity and simplicity-accuracy tradeoff.
Ive tried explaining more carefully where my thinking is currently at in my reply to lucius.
Also caveat that shortforms are halfbaked-by-design.
Yep, have been recently posting shortforms (as per your recommendation), and totally with you on the “halfbaked-by-design” concept (if Cheeseboard can do it, it must be a good idea right? :)
I still don’t agree that free energy is core here. I think that the relevant question, which can be formulated without free energy, is whether various “simplicity/generality” priors push towards or away from human values (and you can then specialize to questions of effective dimension/llc, deep vs. shallow networks, ICL vs. weight learning, generalized ood generalization measurements, and so on to operationalize the inductive prior better). I don’t think there’s a consensus on whether generality is “good” or “bad”—I know Paul Christiano and ARC has gone both ways on this at various points.
I think simplicity/generality priors effectively have 0 effect on whether it’s pushed towards or away from human values, and is IMO kind of orthogonal to alignment-relevant questions.
I’d be curious how you would describe the core problem of alignment.
I’d split it into how do we manage to instill in any goal/value that is ideally at least somewhat stable, ala inner alignment, and outer alignment, which is selecting a goal that is resistant to Goodharting.
Let’s focus on inner alignment. By instill you presumably mean train. What values get trained is ultimately a learning problem which in many cases (as long as one can formulate approximately a boltzmann distribution) comes down to a simplicity-accuracy tradeoff.
Could you give some examples of what you are thinking of here ?
You mean on more general algorithms being good vs. bad?
Yes.
I haven’t thought about this enough to have a very mature opinion. On one hand being more general means you’re liable to goodheart more (i.e., with enough deeply general processing power, you understand that manipulating the market to start World War 3 will make your stock portfolio grow, so you act misaligned). On the other hand being less general means that AI’s are more liable to “partially memorize” how to act aligned in familiar situations, and go off the rails when sufficiently out-of-distribution situations are encountered. I think this is related to the question of “how general are humans”, and how stable are human values to being much more or much less general
I guess im mostly thinking about the regime where AIs are more capable and general than humans.
It seems at first glance that the latter failure mode is more of a capability failure. Something one would expect to go away as AI truly surpasses humans. It doesnt seem core to the alignment problem to me.
Maybe a reductive summary is “general is good if outer alignment is easy but inner alignment is hard, but bad in the opposite case”
Isn’t it the other way around ?
If inner alignment is hard then general is bad because applying less selection pressure, i.e. more generally, more simplicity prior, means more daemons/gremlins