AI Safety person currently working on multi-agent coordination problems.
Jonas Hallgren
Epistemic status: Curious.
TL;DR:
The Sharp Left Turn and specifically the alignment generalization ability is highly dependent on how much slack you allow between each optimisation epoch. By minimizing the slack you allow in the “utility boundary” (the part of the local landscape that is counted as part of you when trying to describe a utility function of the system) you can minmize the expected divergence of the optimization process and therefore minimize the alignment-capability gap?A bit longer from the better half of me (Claude):
Your analysis of the sharp left turn made me reflect on an interesting angle regarding optimization processes and coordination. I’d like to share a framework that builds on but extends beyond your discussion of the (1-3) triad and capabilities generalization:
I believe we can understand sharp left turns through the lens of ‘optimization slack’ - the degree of freedom an optimization process has between correction points. Consider:
Modern ML systems use gradient descent with tight feedback loops and minimal slack
Evolution operated with enormous slack between meaningful corrections
Cultural evolution introduced intermediate coordination mechanisms through shared values
This connects to your discussion of autonomous learning and discernment, but examines it through a different lens. When you describe how ‘capabilities generalize further than alignment,’ I wonder if the key variable isn’t just the generalization itself, but how much slack we permit in the system before correction.
A concrete model I’ve been working with looks at this as boundaries in optimization space [see figure below].
Here’s some pictures from a comment that I left on one of the posts:
Which can be perceived as something like this in the environmental sense:
The hypothesis is that by carefully constraining the ‘utility boundary’ - the region within which a system can optimize while still being considered aligned—we might better control divergence.
I’m curious whether you see this framework as complementary to or in tension with your analysis of the (1-3) triad. Does thinking about alignment in terms of permitted optimization slack add useful nuance to the capabilities vs alignment generalization debate?
Here’s a fun way I sometimes evaluate my own actions:
I was just going outside and I caught myself thinking “Am I acting the best way that I can given that the simulation hypothesis was true and I was the average sample that is spread throughout simulation space due to anthropic reasoning?”
From this I conclude that I should probably have fun because it is good for fun to be spread, I should probably work on something important that helps the world outside of having fun.
Depending on the subjective probability of anthropic reasoning + simulation hypothesis, the spread between the individual versus collective optimisation is different which is quite fun.
It’s almost like the categorical imperative but instead of acting in a way to optimise society, I’m acting in a way to optimise society given a multiverse hypothesis, kinda weird but the brain do be braining sometimes.
So I find the question underspecified, why do you want this?
Why are you decomposing body signalling without looking at the major sub-regulstort systems? If you want to predict sleep then cortisol, melatonin, etc. is something quite good and this will tell you about stress regulation which effects both endocrine as well as cortisol systems.
If you want to look at nutritional systems then GLP-1 activation is good for average food need whilst grelin is predictive of whether you will feel hungry at specific times.
If you’re looking at brain health then serotonin activation patterns can be really good to check but this is different from how the stomach uses it and it does have the majority of serotonin. But this is like way to simplified especially for the brain.
Different subsystems use the same molecules in different ways, waste not and all that so what are you looking for and why?
Okay, that makes sense to me so thank you for explaining!
I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic.
Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.
Those are some great points, made me think of some more questions.
Any thoughts on what language “understood vs not understood” might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?
This makes a lot of sense to me. For some reason it reminds me of some stuart armstrong OOD-generalization work for alternative safeguarding strategies to imperfect value extrapolation? I can’t find a good link though.
I also thought it would be interesting to mention the link to the idea in linguistics that a word is specified by all the different contexts it is specified in and so a symbol is a probability distribution of contextual meaning. From the perspective of this post, wouldn’t natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents? (I don’t think I’m making a novel argument here, I just thought it would be interesting to point out.)
So I still haven’t really figured out how to talk about these things properly as it is more of a vibe than it is an intellectual truth?
Let’s say that you don’t feel a strong sense of self but that you’re instead identified with nothing, there is no self, if you see this then you can see the “deathless”.
It’s pointing out a different metaphysical viewpoint that can be experienced. I agree with you that from a rational point of view this is strictly not true yet it isn’t to be understood, it is to be experienced? You can’t or at least I can’t think my way to it.
I’ve got a bunch of meditation under my belt so my metacognitive awareness is quite good imo.
Stimulants that are attention increasing such as caffiene or modafinil generally lead to more tunnelvision and less metacognitive awareness in my experience. This generally leads to less ability to update opinions quickly.
Nicotine that activates acetylcholine receptors allow for more curiosity which allow me to update more quickly so it is dependent on the stimulant as well as the generak timing. (0.6mg in gum form, too high spike just leads to a hit and not curiosity). It is like being more sensitive and interested in whatever appears around me
If you’re sensitive enough you can start recognizing when different mental modes are firing in your brain and adapt based on what you want, shit is pretty cool.
One of the more common responses I hear at this point is some variation of “general intelligence isn’t A Thing, people just learn a giant pile of specialized heuristics via iteration and memetic spread
I’m very uncertain about the validity of the below question but I shalt ask it anyway and since I don’t trust my own way of expressing it, here’s claude on it:
The post argues that humans must have some general intelligence capability beyond just learning specialized heuristics, based on efficiency arguments in high-dimensional environments. However, research on cultural evolution (e.g., “The Secret of Our Success”, “Cognitive Gadgets”) suggests that much of human capability comes from distributed cultural learning and adaptation. Couldn’t this cultural scaffolding, combined with domain-specific inductive biases (as suggested by work in Geometric Deep Learning), provide the efficiency gains you attribute to general intelligence? In other words, perhaps the efficiency comes not from individual general intelligence, but from the collective accumulation and transmission of specialized cognitive tools?
I do agree that there are specific generalised forms of intelligence, I guess this more points me towards that the generating functions of these might not be optimally sub-divided in the usual way we think about it?Now completely theoretically of course, say someone where to believe the above, why is the following really stupid?:
Specifically, consider the following proposal: Instead of trying to directly align individual agents’ objectives, we could focus on creating environmental conditions and incentive structures that naturally promote collaborative behavior. The idea being that just as virtue ethics suggests developing good character through practiced habits and environmental shaping, we might achieve alignment through carefully designed collective dynamics that encourage beneficial emergent behaviors. (Since this seems to be the most agentic underlying process that we currently have, theoretically of course.)
Well said. I think that research fleets will be a big thing going forward and you expressed why quite well.
I think there’s an extension that we also have to make with some of the safety work we have, especially for control and related agendas. It is to some extent about aligning research fleets and not individual agents.
I’ve been researching ways of going about aligning & setting up these sorts of systems for the last year but I find myself very bottlenecked by not being able to communicate the theories that exists in related fields that well.
It is quite likely that RSI happens in lab automation and distributed labs before anything else. So the question then becomes how one can extend the existing techniques and theory that we currently have to distributed systems of research agents?
There’s a bunch of fun and very interesting decentralised coordination schemes and technologies one can use from fields such as digital democracy and collective intelligence. It is just really hard to prune what will work and to think about what the alignment proposals should be for these things. You usually have emergence which for Agent-Based Models which research systems are a sub-part of and often the best way to predict problems is to actually run the experiments in those systems.
So how in the hell are we supposed to predict the problems without this? What are the experiments we need to run? What types of organisation & control systems should be recommended to governance people when it comes to research fleets?
This delightful piece applies thermodynamic principles to ethics in a way I haven’t seen before. By framing the classic “Ones Who Walk Away from Omelas” through free energy minimization, the author gives us a fresh mathematical lens for examining value trade-offs and population ethics.
What makes this post special isn’t just its technical contribution—though modeling ethical temperature as a parameter for equality vs total wellbeing is quite clever. The phase diagram showing different “walk away” regions bridges the gap between mathematical precision and moral intuition in an elegant way.
While I don’t think we’ll be using ethodynamics to make real-world policy decisions anytime soon, this kind of playful-yet-rigorous exploration helps build better abstractions for thinking about ethics. It’s the kind of creative modeling that could inspire novel approaches to value learning and population ethics.
Also, it’s just a super fun read. A great quote from the conclusion is “I have striven to make this paper a pleasant read by enriching it with all manners of enjoyable things: wit, calculus, and a non indifferent amount of imaginary child abuse”.
That is the type of writing I want to see more of! Very nice.
I really like this! For me it somewhat also paints a vision for what could be which might inspire action.
Something that I’ve generally thought would be really nice to have over the last couple of years is a vision for how an AI Safety field that is decentralized could look like and what the specific levers to pull would be to get there.
What does the optimal form of a decentralized AI Safety science look like?How does this incorporate parts of meta science and potentially decentralized science?
How does this look like with literature review from AI systems? How can we use AI Systems in themselves to create such infrastructure in the field? How do such communication pathways optimally look like?
I feel that there are so many low-hanging fruit here. There are so many algorithms that we could apply to make things better. Yes we’ve got some forums but holy smokes could the underlying distribution and optimisation systems be optimised. Maybe the lightcone crew could cook something in this direction?
Let me drop some examples of “theory” or at least useful bits of information that I find interesting beyond the morphogenesis and free energy principle vibing. I agree with you that basic form of FEP is just another formalization of bayesian network passing formalised through KL-divergence and whilst interesting it doesn’t say that much about foundations. For Artificial Life, it is more a vibe check from having talked to people in the space, it seems to me they’ve got a bunch of thoughts about it but it seems like they’ve got some academic capture so it might be useful to at least talk to the researchers there about your work?
Like a randomly insultingly simple suggestion: Do a quick literature review through elicit in ActInf and Computational Biology for your open questions and see if there’s links, if there are send those people a quick message. I think a bunch of the theory is in people’s heads and if you nerdsnipe them they’re usually happy to give you the time of day.
Here’s some stuff that I think is theoretically cool as a quick sampler:
For Levin’s work:
In the link I posted above he talks about morphogenesis, the thing I find the most interesting there from an agent foundations and information processing perspective is the anti-fragility of systems with respect to information loss (similar to some of the stuff in Uri’s work if I’ve understood that correctly.) There are lots of variations of underlying genetics yet similar structures can be decoded through similar algorithms and it just shows a huge resillience there. It seems you probably know this from Uri’s work already
Active Inference stuff:
Physics as information processing: https://youtu.be/RpOrRw4EhTo
The reason why I find this very interesting is that it seems to me to be saying something fundamental about information processing systems from a limited observer perspective.
I haven’t gotten through the entire series yet but it is like a derivation of hierarchical agency or at least why a controller is needed from first principles.
I think this ACS post explains it better than I do below but here’s my attempt at it:
I’m trying to find the stuff I’ve seen on <<Boundaries>> within Active Inference yet it is spread out and not really centered. There’s this very interesting perspective of there only being model and modelled and that talking about agent foundations is a bit like taking the modeller as the foundational perspective whilst that is a model in itself. Some kind of computational intractability claims together with the above video series gets you to this place where we have a system of hierarchical agents and controllers in a system with each other. I have a hard time explaining it but it is like it points towards a fundamental symmetry perspective between an agent and it’s environment.
Other videos from Levin’s channel:
Agency at the very bottom—some category theory mathy stuff on agents and their fundamental properties: https://youtu.be/1tT0pFAE36c
The Collective Intelligence of Morphogenesis—if I remember correctly it goes through some theories around cognition of cells, there’s stuff about memory, cognitive lightcones etc. I at least found it interesting: https://youtu.be/JAQFO4g7UY8
(I’ve got that book from URI on my reading list btw, reminded me of this book on Categorical systems theory, might be interesting: http://davidjaz.com/Papers/DynamicalBook.pdf)
In your MATS training program from two years ago, you talked about farming bits of information from real world examples before doing anything else as a fast way to get feedback. You then extended this to say that this is quicker than doing it with something like running experiments.
My question is then why you haven’t engaged your natural latentes or what in my head I think of as a “boundary formulation through a functional” with fields such as artificial life or computational biology where these are core questions to answer?
Trying to solve image generation or trying to solve something like fluid mechanics simulations seem a bit like doing the experiment before trying to integrate it with the theory in that field? Wouldn’t it make more sense to try to engage in a deeper way with the existing agent foundations theory in the real world like Michael Levin’s Morphogenesis stuff? Or something like an overview of Artificial Life?
Yes as you say real world feedback loops and working on real world problems, I fully agree but are you sure that you’re done with the problem space exploration? Like these fields already have a bunch of bits on crossing the theory practice gap. You’re trying to cross it by applying the theory in practice yet if that’s the hardest part wouldn’t it make sense to sample from a place that already has done that?
If I’m wrong here, I should probably change my approach so I appreciate any insight you might have.
I love your stuff and I’m very excited to see where you go next.
I would be very curious to hear what you have to say about more multi-polar threat scenarios and extending theories of agency into the collective intelligence frame.What are your takes on Michael Levin’s work on agency and “morphologenesis” in relation to your neuroscience ideas? What do you think about claims of hierarchical extension of these models? How does this affect multipolar threat models? What are the fundamental processes that we should care about? When should we expand these concepts cognitively, when should we constrain them?
I resonate with this framing of evolution as an optimizer and I think we can extend this perspective even further.
Evolution optimizes for genetic fitness, yes. But simultaneously, cultural systems optimize for memetic fitness, markets optimize for economic fitness, and technological systems increasingly optimize for their own forms of fitness. Each layer creates selection pressures that ripple through the others in complex feedback loops. It isn’t necessarily that evolution is the only thing happening, it may be the outermost value function that exists but there’s so much nesting here as well.
There’s only modelling and what is being modelled and these things are happening everywhere all at once. I feel like I fully agree with what you said but I guess for me an interesting point is about what basis to look at it from.
Randomly read this comment and I really enjoyed it, Turn it into a post? (I understand how annoying structuring complex thoughts coherently can be but maybe do a dialogue or something? I liked this.)
I largely agree with a lot of the missing things in people’s views of utility functions and so I think you expressed some of that in a pretty good deeper way.
When we get into acausality and evertt branches I think we’re going a bit off-track. I can think computational intractability and observer bias is something interesting to bring up but I always find it never leads anywhere. Quantum Mechanics is fundamentally observer invariant and so positing something like MWI is a philosophical stance (that is supported by occam’s razor) but it is still observer dependent, what if there are no observers?
(Pointing at Physics as Information Processing)
Do you have any specific reason why you’re going into QMech when talking about brain-like AGI stuff?
Most of the time, the most high value conversations aren’t fully spontaneous for me but they’re rather on open questions that I’ve already prepped beforehand. They can still be very casual, it is just that I’m gathering info in the background.
I usually check out the papers submitted or the participants if it’s based on swapcard and do some research beforehand on what people I want to meet. Then I usually have some good opener that leads to some interesting conversations. These conversations can be very casual and can span wide areas but I feel I’m building a relationship with an interesting individual and that’s really the main benefit for me.
At the latest ICML, I talked to a bunch of interesting multi-agent researchers through this method and I now have people I can ask stupid questions.
I also always come to conferences with one or more specific projects that I want advice on which makes these conversations a lot easier to have.
Extremely long chain of thought, no?
Thank you for being patient with me, I tend to live in my own head a bit with these things :/ Let me know if this explanation is clearer using the examples you gave:
Let me build on the discussion about optimization slack and sharp left turns by exploring a concrete example that illustrates the key dynamics at play.
Think about the difference between TD-learning and Monte Carlo methods in reinforcement learning. In TD-learning, we update our value estimates frequently based on small temporal differences between successive states. The “slack”—how far we let the system explore/optimize between validation checks—is quite tight. In contrast, Monte Carlo methods wait until the end of an episode to make updates, allowing much more slack in the intermediate steps.
This difference provides insight into the sharp left turn problem. When we allow more slack between optimization steps (like in Monte Carlo methods), the system has more freedom to drift from its original utility function before course correction. The divergence compounds particularly when we have nested optimization processes—imagine a base model with significant slack that then has additional optimization layers built on top, each with their own slack. The total divergence potential multiplies.
This connects directly to your point about GPT-style models versus AlphaZero. While the gradient descent itself may have tight feedback loops, the higher-level optimization occurring through prompt engineering or fine-tuning introduces additional slack. It’s similar to how cultural evolution, with its long periods between meaningful corrections, allowed for the emergence of inner optimizers that could significantly diverge from the original selection pressures.
I’m still working to formalize precisely what mathematical structure best captures this notion of slack—whether it’s best understood through the lens of utility boundaries, free energy, or some other framework.