I sometimes think of alignment as having two barriers:
Obtaining levers that can be used to design and shape an AGI in development.
Developing theory that predicts the effect of your design choices.
My current understanding of your agenda, in my own words:
You’re trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You’re collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)
As new tactics are developed, you’re hoping that expertise and robust theories develop around building systems this way. (3)
This by itself doesn’t scale to hard problems, so you’re trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)
Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)
Have I understood this correctly?
I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point?
Maybe it’d be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
Unexpected edge cases in value specification
Goals stability across ontology shifts
Reflective stability of goals
Optimization daemons or simpler self-reinforcing biases
Maintaining interruptibility against instrumental convergence
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it’s risky to do the hard part last.
but because the same solutions that will make AI systems beneficial will also make them safer
This is often not true, and I don’t think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
In practice, sadly, developing a true ELM is currently too expensive for us to pursue
Expensive why? Seems like the bottleneck here is theoretical understanding.
I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don’t think “LLM calls” as being fundamental, I think of LLMs as a stand-in for “banks of patterns” or “piles of shards of cognition.” The exact shape of this can vary, LLMs are just our current most common shape of “cognition engines”, but I can think of many other, potentially better, shapes this “neural primitive/co-processor” could take.
I think there is some deep, as of yet unformalized, concept of computer science that differentiates what are intuitively “cognitive”/”neural” type problems vs “classical”/”code” type problems. Why can neural networks easily recognize dogs but doing it in regular code is hell? How can one predict ahead of time whether a given task can be solved with a given set of programming tools or neural network components? Some kind of vastly more advanced form of Algorithmic Information Theory, that can take in as input your programming tools and libraries, and a description of the problem you are trying to solve, and output how hard it is going to be (or what “engineering complexity class” it would belong to, whatever that means). I think this is a vast, unsolved question of theoretical computer science, that I don’t expect we will solve any sooner than we are going to solve P vs NP.
So, in absence of such principled understanding, we need to find the “engineering approximation equivalent” to this, which involves using as much code as we can and bounding the neural components as much as we can, and then developing good practical engineering around this paradigm.
Maybe it’d be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
The way I see it, there are two main ways in which I see things differently in the CoEm frame:
First, the hope isn’t so much that CoEm “solves” these problems, but makes them irrelevant, because it makes it possible to not slip into the dangerous/unpredictable capabilities regime unexpectedly. If you can’t ensure your system won’t do something funky, you can simply choose not to build it, and instead decide to build something you can ensure proper behavior of. Then you can iterate, unlike in the current “pump as much LLM/RL juice as possible as fast as possible” paradigm.
In other words, CoEm makes it easier to distinguish between capabilities failures and alignment failures.
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it’s risky to do the hard part last.
Secondly, more speculatively, I expect these problems to dissolve under better engineering and understanding. Here I am trying to point at something like “physicalism” or “gears level models.” If you have gears level models, a lot of the questions you might ask in a non-gears-level model stop making sense/being relevant, and you find new, more fundamental questions and tradeoff.
I think ontologies such as Agents/Goals are artifacts of poor understanding of deeper mechanics. If you can’t understand the inner mechanics of cell biology, then maybe psychology is the best you can do to predict a human. But if you can understand cell biology and construct a biological being from scratch, I think you don’t need the Agent framing, and it would be actively confusing to insist it is ontologically primitive somehow and must be “addressed” in your final description of the system you are engineering. These kinds of abstract/functionalist/teleologically models might be a good source of inspiration for messing around, but this is not the shape that the true questions will have.
“Instrumental convergence” dissolves into questions of predictability, choices of resource allocation and aesthetic/ethical stances on moral patienthood/universal rights. Those problems aren’t easy, but they are different and more “fundamental”, more part of the territory than of the map.
Similarly, “Reflective stability of goals” is just a special case of predicting what your system does. It’s not a fundamental property that AGIs have and other software doesn’t.
The whole CoEm family of ideas is pointing in this direction, encouraging the uncovering of more fundamental, practical, grounded, gears level models, by means of iterative construction. I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing. (It’s like the epistemological equivalent of the AI Effect, but for good, lol.)
I think that picking a hard problem before you know whether that “hard problem” is real or not is exactly what leads to confusions like the “hard problem of consciousness”, followed by zero actual progress on problems that matter. I don’t actually think we know what the true “hard problems” are to a level of deconfusion that we can just tackle them directly and backchain. Backchaining from a confused or wrong goal is one of the best ways to waste an entire career worth of research.
Not saying it is guaranteed to solve all these problems, or that I am close to having solved all these problems, but this agenda is the type of thing I would do if I wanted to make iterative research progress into that direction.
This is often not true, and I don’t think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
It’s kinda trivially true in that the point of the agenda is to get to legibility, and if you sacrifice on legibility/constructibility, you are no longer following the paradigm, but I realize that is not an interesting statement. Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
Expensive why? Seems like the bottleneck here is theoretical understanding.
Literally compute and man-power. I can’t afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing.
Here’s two ways that a high-level model can be wrong:
It isn’t detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-cases will break it. But the model basically still works, and is valuable for working out the more detailed model.
It’s built out of confused concepts. E.g. free will, consciousness (probably), many ways of thinking about personal identity, four humors model. We’re basically better off without this kind of model and should start from scratch.
It sounds like you’re saying high-level agency-as-outcome-directed is wrong in the second way? If so, I disagree, it looks much more like the first way. I don’t think I understand your beliefs well enough to argue about this, maybe there’s something I should read?
I have a discomfort that I want to try to gesture at:
Are you ultimately wanting to build a piece of software that solves a problem so difficult that it needs to modify itself? My impression from the post is that you are thinking about this level of capability in a distant way, and mostly focusing on much earlier and easier regimes. I think it’s probably very easy to work on legible low-level capabilities without making any progress on the regime that matters.
To me it looks important for researchers to have this ultimate goal constantly in their mind, because there are many pathways off-track. Does it look different to you?
Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
I think this is a bad place to rely on governance, given the fuzziness of this boundary and the huge incentive toward capability over legibility. Am I right in thinking that you’re making a large-ish gamble here on the way the tech tree shakes out (such that it’s easy to see a legible-illegible boundary, and the legible approaches are competitive-ish) and also the way governance shakes out (such that governments decide that e.g. assigning detailed blame for failures is extremely important and worth delaying capabilities)?
I’m glad you’re doing ambitious things, and I’m generally a fan of trying to understand problems from scratch in the hope that they dissolve or become easier to solve.
Literally compute and man-power. I can’t afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Why would this be a project that requires large scale experiments? Looks like something that a random PhD student with two GPUs could maybe make progress on. Might be a good problem to make a prize for even?
I sometimes think of alignment as having two barriers:
Obtaining levers that can be used to design and shape an AGI in development.
Developing theory that predicts the effect of your design choices.
My current understanding of your agenda, in my own words:
You’re trying to create a low-capability AI paradigm that has way more levers. This paradigm centers on building useful systems by patching together LLM calls. You’re collecting a set of useful tactics for doing this patching. You can rely on tactics in a similar way to how we rely on programming language features, because they are small and well-tested-ish. (1 & 2)
As new tactics are developed, you’re hoping that expertise and robust theories develop around building systems this way. (3)
This by itself doesn’t scale to hard problems, so you’re trying to develop methods for learning and tracking knowledge/facts in a way that interfaces with the rest of it in a way that remains legible. (4)
Maybe with some additional tools, we build a relatively-legible emulation of human thinking on top of this paradigm. (5)
Have I understood this correctly?
I feel like the alignment section of this is missing. Is the hope that better legibility and experience allows us to solve the alignment problems that we expect at this point?
Maybe it’d be good to name some speculative tools/theory that you hope to have been developed for shaping CoEms, then say how they would help with some of:
Unexpected edge cases in value specification
Goals stability across ontology shifts
Reflective stability of goals
Optimization daemons or simpler self-reinforcing biases
Maintaining interruptibility against instrumental convergence
Most alignment research skips to trying to resolve issues like these first, at least in principle. Then often backs off to develop a relevant theory. I can see why you might want to do the levers part first, and have theory develop along with experience building things. But it’s risky to do the hard part last.
This is often not true, and I don’t think your paradigm makes it true. E.g. often we lose legibility to increase capability, and that is plausibly also true during AGI development in the CoEm paradigm.
Expensive why? Seems like the bottleneck here is theoretical understanding.
Thanks for the comment!
I am most confident in phases 1-3 of this agenda, and I think you have overall a pretty good rephrasing of 1-5, thanks! One note is that I don’t think “LLM calls” as being fundamental, I think of LLMs as a stand-in for “banks of patterns” or “piles of shards of cognition.” The exact shape of this can vary, LLMs are just our current most common shape of “cognition engines”, but I can think of many other, potentially better, shapes this “neural primitive/co-processor” could take.
I think there is some deep, as of yet unformalized, concept of computer science that differentiates what are intuitively “cognitive”/”neural” type problems vs “classical”/”code” type problems. Why can neural networks easily recognize dogs but doing it in regular code is hell? How can one predict ahead of time whether a given task can be solved with a given set of programming tools or neural network components? Some kind of vastly more advanced form of Algorithmic Information Theory, that can take in as input your programming tools and libraries, and a description of the problem you are trying to solve, and output how hard it is going to be (or what “engineering complexity class” it would belong to, whatever that means). I think this is a vast, unsolved question of theoretical computer science, that I don’t expect we will solve any sooner than we are going to solve P vs NP.
So, in absence of such principled understanding, we need to find the “engineering approximation equivalent” to this, which involves using as much code as we can and bounding the neural components as much as we can, and then developing good practical engineering around this paradigm.
The way I see it, there are two main ways in which I see things differently in the CoEm frame:
First, the hope isn’t so much that CoEm “solves” these problems, but makes them irrelevant, because it makes it possible to not slip into the dangerous/unpredictable capabilities regime unexpectedly. If you can’t ensure your system won’t do something funky, you can simply choose not to build it, and instead decide to build something you can ensure proper behavior of. Then you can iterate, unlike in the current “pump as much LLM/RL juice as possible as fast as possible” paradigm.
In other words, CoEm makes it easier to distinguish between capabilities failures and alignment failures.
Secondly, more speculatively, I expect these problems to dissolve under better engineering and understanding. Here I am trying to point at something like “physicalism” or “gears level models.” If you have gears level models, a lot of the questions you might ask in a non-gears-level model stop making sense/being relevant, and you find new, more fundamental questions and tradeoff.
I think ontologies such as Agents/Goals are artifacts of poor understanding of deeper mechanics. If you can’t understand the inner mechanics of cell biology, then maybe psychology is the best you can do to predict a human. But if you can understand cell biology and construct a biological being from scratch, I think you don’t need the Agent framing, and it would be actively confusing to insist it is ontologically primitive somehow and must be “addressed” in your final description of the system you are engineering. These kinds of abstract/functionalist/teleologically models might be a good source of inspiration for messing around, but this is not the shape that the true questions will have.
“Instrumental convergence” dissolves into questions of predictability, choices of resource allocation and aesthetic/ethical stances on moral patienthood/universal rights. Those problems aren’t easy, but they are different and more “fundamental”, more part of the territory than of the map.
Similarly, “Reflective stability of goals” is just a special case of predicting what your system does. It’s not a fundamental property that AGIs have and other software doesn’t.
The whole CoEm family of ideas is pointing in this direction, encouraging the uncovering of more fundamental, practical, grounded, gears level models, by means of iterative construction. I think we currently do not have good gears level models of lots of the important questions of AI/cognition/alignment, and I think the way to get there is by treating it as a software/physicalist/engineering problem, not presupposing an already higher level agentic/psychological/functionalist framing. (It’s like the epistemological equivalent of the AI Effect, but for good, lol.)
I think that picking a hard problem before you know whether that “hard problem” is real or not is exactly what leads to confusions like the “hard problem of consciousness”, followed by zero actual progress on problems that matter. I don’t actually think we know what the true “hard problems” are to a level of deconfusion that we can just tackle them directly and backchain. Backchaining from a confused or wrong goal is one of the best ways to waste an entire career worth of research.
Not saying it is guaranteed to solve all these problems, or that I am close to having solved all these problems, but this agenda is the type of thing I would do if I wanted to make iterative research progress into that direction.
It’s kinda trivially true in that the point of the agenda is to get to legibility, and if you sacrifice on legibility/constructibility, you are no longer following the paradigm, but I realize that is not an interesting statement. Ultimately, this is a governance problem, not a technical problem. The choice to choose illegible capabilities is a political one.
Literally compute and man-power. I can’t afford the kind of cluster needed to even begin a pretraining research agenda, or to hire a new research team to work on this. I am less bottlenecked on the theoretical side atm, because I need to run into a lot of bottlenecks from actual grounded experiments first.
Here’s two ways that a high-level model can be wrong:
It isn’t detailed enough, but once you learn the detail it adds up to basically the same picture. E.g. Newtonian physics, ideal gas laws. When you get a more detailed model, you learn more about which edge-cases will break it. But the model basically still works, and is valuable for working out the more detailed model.
It’s built out of confused concepts. E.g. free will, consciousness (probably), many ways of thinking about personal identity, four humors model. We’re basically better off without this kind of model and should start from scratch.
It sounds like you’re saying high-level agency-as-outcome-directed is wrong in the second way? If so, I disagree, it looks much more like the first way. I don’t think I understand your beliefs well enough to argue about this, maybe there’s something I should read?
I have a discomfort that I want to try to gesture at:
Are you ultimately wanting to build a piece of software that solves a problem so difficult that it needs to modify itself? My impression from the post is that you are thinking about this level of capability in a distant way, and mostly focusing on much earlier and easier regimes. I think it’s probably very easy to work on legible low-level capabilities without making any progress on the regime that matters.
To me it looks important for researchers to have this ultimate goal constantly in their mind, because there are many pathways off-track. Does it look different to you?
I think this is a bad place to rely on governance, given the fuzziness of this boundary and the huge incentive toward capability over legibility. Am I right in thinking that you’re making a large-ish gamble here on the way the tech tree shakes out (such that it’s easy to see a legible-illegible boundary, and the legible approaches are competitive-ish) and also the way governance shakes out (such that governments decide that e.g. assigning detailed blame for failures is extremely important and worth delaying capabilities)?
I’m glad you’re doing ambitious things, and I’m generally a fan of trying to understand problems from scratch in the hope that they dissolve or become easier to solve.
Why would this be a project that requires large scale experiments? Looks like something that a random PhD student with two GPUs could maybe make progress on. Might be a good problem to make a prize for even?