One of my favorite things about you, John, is that you are excellent at prompting me to direct my attention towards important questions which lead to promising insights. Thanks for that!
I answered your questions, originally in my private notes, but partway through decided I would comment them.
Imagine that your field achieved perfection—the ultimate theory, perfect understanding, building The Thing.
What has been achieved in the idealized version of the field, which has not yet been achieved today? What are the main barriers between here and there?
Detailed understanding of what it means for one agent to be aligned with another agent, or a group of agents.
We can easily check the functional properties of a causal process and argue they satisfy theorems saying WHP it veers towards highly desirable states.
Like, I could point out the ways in which DRL fails at a glance, without appealing to instrumental convergence in particular?
Or maybe that’s part of the theory?
These theorems are intuitively obviously correct and corresponding to original-intuitive-reality.
They are so correct that it’s easy to persuade people to use the aligned approach.
We understand what agents are, and how people fit into that picture, and the theory retrodicts all past problems with governments, with corporations, with principal agents
We know how to take any reasonable training process and make it an aligned training process, with minimal alignment tax (<5%)
Barriers:
We don’t know what agents are.
We don’t know what alignment means
We don’t know how to prove the right kinds of theorems
We don’t know if our concepts are even on the right track
They probably aren’t, except insofar as they spur the right language. It feels more like “how do I relate KL div and optimality probability” rather than “let’s prove theorems about retargetability”
Often, in hindsight, a field turns out to have been bottlenecked on the development of some new measurement method, ranging from physical devices like the thermometer to abstract ideas like Shannon’s entropy and information channel capacity.
In what places does it look like your field is bottlenecked on the ability to usefully measure something? What are the main barriers to usefully measuring those things?
Bottlenecks
Abstraction
Uninterpretable
Not sure where to draw “category boundaries”
Alignment
Don’t know what alignment really is
Or how to measure ground truth
Needs vocab of concepts we don’t have yet
Power-seeking
Unclear what gets trained and how to measure according to distributions
Would need current agent beliefs to use current formalism
Also current formalism doesn’t capture game theoretic aspects of logical blackmail etc
Intent seems more important
But this is bottlenecked on “what is going on in the agent’s mind”
Interpretability
I’m not sure. Networks are big and messy.
Capability
Actually I bet this is natural in terms of the right alignment language
Compute efficiency in terms of ability to bring about cognitively important outcomes
Seems strictly harder than “capability”
Time remaining until TAI
Uncertainty about how AI works, how minds work, what the weighted-edge distance is on lattice of AI discoveries
Barriers
Conceptual roadblocks
But what?
(Filled in above)
What are the places where your field is flailing around in the dark, trying desperate ideas and getting unhelpful results one after another? What are the places where it feels like the problem is formulated in the wrong language, and a shift to another frame might be required to ask the right question or state the right hypothesis?
(Also I wouldn’t be surprised if I’d say Vanessa’s work is not flailing, if I could understand it)
Retargetability → IC seems like an important piece but not the whole thing, part of it is phrased correctly
AUP was flailing
Did I get impact wrong? Or is reward maximization wrong?
I think I got impact right philosophically, but not the structure of how to get one agent to properly care about impact on other agents.
I just found a good trick (penalize agent for impact to other goals it could have and pursue) which works really well in a range of cognitively available situations (physical proximity) but which breaks down under tons of optimization pressure
And the “don’t gain power for your own goal” seems like it should be specifiable and non-hacky, but I don’t actually see how to do it right.
But note that getting impact right for-real wouldn’t save the world AFAICT
Basically everything else
What happened when I shifted to the retargetability frame?
IDT i did that until recently, actually; original post was too anchored on instrumental convergence over outcome sets, missing the elegant functional statement
and my shift to this frame still feels incomplete.
Corrigibility still feels like it should work in the right language and grounding
Sometimes, we have a few different models, each of which works really well in different places. Maybe it feels like there should be some model which unifies them all, which could neatly account for all these phenomena at once—like the unification of electricity, magnetism and optics in the 19th century.
Are there different models in your field which feel like they point to a not-yet-known unified model?
I guess different decision theories? Not super familiar
Not coming up with as many thoughts here, because I feel like our “partial models” are already contradicted and falsified on their putative domains of applicability, so what good would a unification do? More concisely stated wrongness?
One of the main ways we notice (usually implicit) false assumptions in our models is when they come into conflict with some other results, patterns or constraints. This may look like multiple models which cannot all be true simultaneously, or it may look like one model which looks like it cannot be true at all yet nonetheless keeps matching reality quite well. This is a hint to reexamine the assumptions under which the models are supposedly incompatible/impossible, and especially look for any hidden assumptions in that impossibility argument.
Are there places in your field where a few models look incompatible, or one model looks impossible, yet nonetheless the models match reality quite well?
Tempted to say “no”, because of the last phrase in the last sentence not seeming true.
Here was one. The instrumental convergence theorems required a rather precise environmental symmetry, which seemed weird. But now I have a new theory which relates abstract functional properties of how events come about, to those events coming out similarly for most initial conditions. And that doesn’t have anything to do with environmental / outcome-level symmetries. So at first the theory was right in its domain of applicability, but the domain seemed absurdly narrow on a few dimensions.
it was narrow not because I missed how to prove sufficiently broad theorems about MDPs, but because I was focusing on the wrong details and missing the broader concept underlying everything I’d observed.
I guess the impossibility of value learning seems correct but spiritually inapplicable to the problem we want to solve, but I don’t quite know how to articulate that.
A few months back I wrote about how corrigibility is often impossible under reward maximization. But reward maximization seems pretty useful for motivating agents. But it’s so so so broken for nontrivial kinds of motivation.
The space of possible physical laws or theorems or principles is exponentially vast. Sometimes, the hard part is to figure out what the relevant factors are at all. For instance, to figure out how to reproducibly culture a certain type of cell, a biologist might need to provide a few specific signal molecules, a physical medium with the right elasticity or density, a particular temperature range, and/or some other factors which nobody even thought to test yet.
Are there places in your field where nobody even knows what key factors must be controlled for some important outcome to robustly occur?
In a robotics task, how would we ensure test-time agents did anywhere between 2 and 10 jumping jacks in an episode?
What factors would you control there? We don’t know how to “target” along these dimensions, at least it would take more effort than I think it should
Are there places in your field where some concept seems very central to understanding, but nobody knows its True Name yet?
Corrigibility
Corrigibility
Corrigibility
Alignment
I remember thinking there was a concept like this related to logical time, but I forget what it was
At the social level, what are the barriers to solving the main problems in the previous two questions? Why aren’t they already solved? Why isn’t progress being made, or made faster?
Bad incentives — I think people working on eg value learning should not do that
I think the marginal return from more researcher hours on deconfusion outweighs the return from empirical wisdom in applying known-broken paradigms—gut instinct
People don’t know the feel of a True Name-having object, or forget how important it the name discovery is.
I do this sometimes.
Or it’s really hard
Or we need more kinds of minds looking at this problem
But I don’t know what this means
Are there places where your field as a whole, or you personally, pursue things which won’t really help with the main problems (but might kind of “look like” they address the problems)?
Field as a whole
Value learning seems doomed, why are people still working on it
There are non-core researchers working in ways which don’t make sense to me, but I don’t remember who they are or what they were working on
They’re more field-adjacent
I used to be more excited about IDA-style insights but now I feel queasy about skipping over the hard parts of a problem without really getting insights about how alignment works
This is a lesson which I took too long to learn, where I was too tolerant of finding “clever” ways to box my uncertainty. Sequences article about this but i dont wanna go find it right now
What kind of person can I become who would notice this error on my own, before making it, before hearing that this is a failure mode?
Anyways.
Me
I think impact measurement is doomed at least in the current paradigm
In the hindsight of a perfected field, I think impact regularization would be a thing you can do robustly, but not to do a decisive act
I’m basically done working on impact measurement though
I’m finishing up my PhD right now, but I think I’m doing pretty well on this axis now. I used to be bad at pica.
Pick someone you know, or a few people, who are smart and have good judgment. What would their answers to these questions be?
John, I think you would not strongly disagree with most anything I said, but I feel like you would say that corrigibility isn’t as pragmatically important to understand. Or, you might say that True-Name-corrigibility is actually downstream of the True-Name-alignment concepts we need, and it’s the epiphenomenon. I don’t know. This prediction is uncertain and felt more like I queried my John model to give a mental speech which is syntactically similar to a real-John-speech, rather than my best prediction of what you would say.
One of my favorite things about you, John, is that you are excellent at prompting me to direct my attention towards important questions which lead to promising insights. Thanks for that!
I answered your questions, originally in my private notes, but partway through decided I would comment them.
Detailed understanding of what it means for one agent to be aligned with another agent, or a group of agents.
We can easily check the functional properties of a causal process and argue they satisfy theorems saying WHP it veers towards highly desirable states.
Like, I could point out the ways in which DRL fails at a glance, without appealing to instrumental convergence in particular?
Or maybe that’s part of the theory?
These theorems are intuitively obviously correct and corresponding to original-intuitive-reality.
They are so correct that it’s easy to persuade people to use the aligned approach.
We understand what agents are, and how people fit into that picture, and the theory retrodicts all past problems with governments, with corporations, with principal agents
We know how to take any reasonable training process and make it an aligned training process, with minimal alignment tax (<5%)
Barriers:
We don’t know what agents are.
We don’t know what alignment means
We don’t know how to prove the right kinds of theorems
We don’t know if our concepts are even on the right track
They probably aren’t, except insofar as they spur the right language. It feels more like “how do I relate KL div and optimality probability” rather than “let’s prove theorems about retargetability”
Bottlenecks
Abstraction
Uninterpretable
Not sure where to draw “category boundaries”
Alignment
Don’t know what alignment really is
Or how to measure ground truth
Needs vocab of concepts we don’t have yet
Power-seeking
Unclear what gets trained and how to measure according to distributions
Would need current agent beliefs to use current formalism
Also current formalism doesn’t capture game theoretic aspects of logical blackmail etc
Intent seems more important
But this is bottlenecked on “what is going on in the agent’s mind”
Interpretability
I’m not sure. Networks are big and messy.
Capability
Actually I bet this is natural in terms of the right alignment language
Compute efficiency in terms of ability to bring about cognitively important outcomes
Seems strictly harder than “capability”
Time remaining until TAI
Uncertainty about how AI works, how minds work, what the weighted-edge distance is on lattice of AI discoveries
Barriers
Conceptual roadblocks
But what?
(Filled in above)
Flailing:
IDA
ELK
Everything theoretical feels formulated wrong, except maybe logical induction / FFS / John’s work
This is important!
(Also I wouldn’t be surprised if I’d say Vanessa’s work is not flailing, if I could understand it)
Retargetability → IC seems like an important piece but not the whole thing, part of it is phrased correctly
AUP was flailing
Did I get impact wrong? Or is reward maximization wrong?
I think I got impact right philosophically, but not the structure of how to get one agent to properly care about impact on other agents.
I just found a good trick (penalize agent for impact to other goals it could have and pursue) which works really well in a range of cognitively available situations (physical proximity) but which breaks down under tons of optimization pressure
And the “don’t gain power for your own goal” seems like it should be specifiable and non-hacky, but I don’t actually see how to do it right.
But note that getting impact right for-real wouldn’t save the world AFAICT
Basically everything else
What happened when I shifted to the retargetability frame?
IDT i did that until recently, actually; original post was too anchored on instrumental convergence over outcome sets, missing the elegant functional statement
and my shift to this frame still feels incomplete.
Corrigibility still feels like it should work in the right language and grounding
I guess different decision theories? Not super familiar
Not coming up with as many thoughts here, because I feel like our “partial models” are already contradicted and falsified on their putative domains of applicability, so what good would a unification do? More concisely stated wrongness?
Tempted to say “no”, because of the last phrase in the last sentence not seeming true.
Here was one. The instrumental convergence theorems required a rather precise environmental symmetry, which seemed weird. But now I have a new theory which relates abstract functional properties of how events come about, to those events coming out similarly for most initial conditions. And that doesn’t have anything to do with environmental / outcome-level symmetries. So at first the theory was right in its domain of applicability, but the domain seemed absurdly narrow on a few dimensions.
it was narrow not because I missed how to prove sufficiently broad theorems about MDPs, but because I was focusing on the wrong details and missing the broader concept underlying everything I’d observed.
I guess the impossibility of value learning seems correct but spiritually inapplicable to the problem we want to solve, but I don’t quite know how to articulate that.
A few months back I wrote about how corrigibility is often impossible under reward maximization. But reward maximization seems pretty useful for motivating agents. But it’s so so so broken for nontrivial kinds of motivation.
In a robotics task, how would we ensure test-time agents did anywhere between 2 and 10 jumping jacks in an episode?
What factors would you control there? We don’t know how to “target” along these dimensions, at least it would take more effort than I think it should
Corrigibility
Corrigibility
Corrigibility
Alignment
I remember thinking there was a concept like this related to logical time, but I forget what it was
Bad incentives — I think people working on eg value learning should not do that
I think the marginal return from more researcher hours on deconfusion outweighs the return from empirical wisdom in applying known-broken paradigms—gut instinct
People don’t know the feel of a True Name-having object, or forget how important it the name discovery is.
I do this sometimes.
Or it’s really hard
Or we need more kinds of minds looking at this problem
But I don’t know what this means
Field as a whole
Value learning seems doomed, why are people still working on it
There are non-core researchers working in ways which don’t make sense to me, but I don’t remember who they are or what they were working on
They’re more field-adjacent
I used to be more excited about IDA-style insights but now I feel queasy about skipping over the hard parts of a problem without really getting insights about how alignment works
This is a lesson which I took too long to learn, where I was too tolerant of finding “clever” ways to box my uncertainty. Sequences article about this but i dont wanna go find it right now
What kind of person can I become who would notice this error on my own, before making it, before hearing that this is a failure mode?
Anyways.
Me
I think impact measurement is doomed at least in the current paradigm
In the hindsight of a perfected field, I think impact regularization would be a thing you can do robustly, but not to do a decisive act
I’m basically done working on impact measurement though
I’m finishing up my PhD right now, but I think I’m doing pretty well on this axis now. I used to be bad at pica.
John, I think you would not strongly disagree with most anything I said, but I feel like you would say that corrigibility isn’t as pragmatically important to understand. Or, you might say that True-Name-corrigibility is actually downstream of the True-Name-alignment concepts we need, and it’s the epiphenomenon. I don’t know. This prediction is uncertain and felt more like I queried my John model to give a mental speech which is syntactically similar to a real-John-speech, rather than my best prediction of what you would say.