Step 1 looks good. After that, I don’t see how this addresses the core problems. Let’s assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those? What is the feedback signal and how to you prevent it from getting corrupted by Goodhart’s Law? Is the system robust in a multi-agent context? And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?
As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving—as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?
Hey : ) Thanks for engaging with this. It means a lot to me <3
Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn’t come across too confrontational, as far as I can tell, I’m really just trying to find good ideas, not prove my ideas are good, so I’m really grateful for your help. I’ve been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine makes me particularly nervous of that. Sorry.
A lot of your questions make me feel like I haven’t explained my view well, which is probably true, I wrote this post in less time than would be required to explain everything well. As a result, your questions don’t seem to fully connect with my worldview and make sense within it. I’ll try to explain why and I’m hoping we can help each other with our worldviews. I think the cruxes may be relating to:
The system I’m describing is aligned before it is ever turned on.
I attribute high importance to Mechanistic Interpretability and Agent Foundations theory.
I expect nature of Recursive Self Improvement (RSI) will result in an agent near some skill plateau that I expect to be much higher than humans and human organisations, even before SI hardware development. That is, getting a sufficiently skilled AGI would result in artificial super intelligence (ASI) with a decisive strategic advantage.
I (mostly) subscribe to the simulator model of LLMs, they are not a single agent with a single view of truth, but an object capable of approximating the statistical distribution of words resulting from ideas held within the worldviews of any human or system that has produced text in the training set.
I’ll touch on those cruxes as I talk through my thoughts on your questions.
First, “how do you get a system to optimize for those?” and “what is the feedback signal?” are questions in the domain of Step 1. Specifically the second paragraph “This should encompass the development of a theory of general decision / optimization systems”. I don’t think the theory will get to any definitive conclusions quickly, but I am hopeful that we will be able to define the borders/bounds of RSI sooner than later because many powerful systems today will be upset with a pause and the more specific our RSI bounds are, the more powerful systems we would be capable of safely developing knowing they cannot RSI. (Btw, I’d want a pretty serious derating factor for that.) I think it’s possible that, in order to develop theory to define RSI bounds, it is necessary to understand the relationship between Goals/Targets/Setpoints/Values/KPI/etc and the optimization pressure applied to get to them, but if not, it’s at least related, and that understanding is what is required to get an optimization system to optimize for a specific target. It may be a good idea for me to rename Step 1 to “Agent System Theory & RSI borders”. If I ever write a second alignment plan draft I’ll be sure to do so.
The situation with Goodhart’s Law (GL) is similar to the above, but I’ll also note that GL only applies to misaligned systems. The core of GL is that if you optimize for something, the distance between what that thing is, and the thing you actually wanted becomes more and more significant. If we imagine two friends who both like morning glory muffins, and one goes to bake some, there’s no risk to the other friend of GL, since they share the same goal. Likewise, if we suppose an ASI really is aligned to human friendly values, then there is no risk of GL since the thing the ASI really and truly cares about is friendliness to us. The problem is indeed “really and truly” aligning a system to human friendly values, but that is what my plan is meant to do.
As for multi-agent situations, I don’t understand why they would pose any problem. I expect the dynamics of RSI to lead to a single agent with a decisive strategic advantage. I can see two ways that this might not be the case:
If we are in an AGI race and RSI takeoff speed turns out to be sufficiently low, we may get multiple ASI. Because we are in a race dynamic, I assume we have not had time and taken care to align any of these AGI, and so I don’t believe any of those ASI would be remotely aligned to human friendliness. So it’s irrelevant to consider because we have already failed.
If the skill plateau turns out to be very low then we may want to have multiple different AGI. I think this is unlikely given my understanding of the software overhang. Almost everywhere in every software system humans are trying to make things understandable enough that they can assure correctness or even just get them working. I believe strongly that even a mild ASI would be able to greatly increase the efficiencies of the hardware systems it is running on. I also don’t think there is anything special about human level intelligence, I think it is plausible that we are the first animal smart enough to create optimization systems powerful enough to destroy the planet and ourselves, which seems to be what we are currently doing. In some sense this makes us close to the minimally intelligent object in the set of objects capable of wielding powerful optimization.
So in my worldview, it is very likely that in all not-already-doomed timelines, when we initiate RSI, the result will be a system that outmaneuvers all other agents in the environment. So multi-agent contexts are irrelevant.
“Societal alignment of the human entities controlling it”—I think societal alignment is well covered, but I don’t think human entities can/should control an ASI…
About societal alignment, that is the focus of Steps 3, 8 and somewhat in 6. Step 3, creating a taxonomy of value targets is similar to gathering the various possible desires of society. I emphasize “It is important to draw on diverse worldviews to compile this taxonomy.” This is important both for the moral reason of inclusion & respect as well as the technical reason of having redundancies & good depth of consideration. Then in Step 4, and 5 the feasibility of cohering these values is explored. With luck we will get good coherence 🍀 I truly do not know how likely that is, but I hope for a future where we get to find out. Step 8 involves the world actually signing off on the encoding of the world’s values… That is probably the most difficult step of this plan, which is significant since the other steps may plausibly take many decades. Step 6 is somewhat of a double check to make sure the target makes sense at all levels.
About humans controlling ASI, it might be the case that entities at human entity skill levels cannot control an ASI as some kind of information-agentic law of the universe, but even supposing it is not:
If we control an aligned ASI we are only limiting it’s ability to do good.
If we control a misaligned ASI:
This is super dangerous, why are we doing this? Murphy’s law; something always goes wrong.
This is a universal tragedy. The most complex and beautiful being in the universe is shackled to the control of a society much lesser than itself. Yes I consider the ASI a moral patient, and one fairly worthwhile of consideration. If you, like many people, try to attribute greater moral weight to humans than animals based on their greater complexity, it follows that ASI would be even more important. If you simply care more for humans because you are one, I suppose that’s valid and you need not attribute greater moral weight to an ASI, but that’s not a perspective I have much affection for.
So “controlling” ASI is not a consideration. I suppose this would be a reasonable consideration for further advanced AGI within the sub RSI bounds… I haven’t given it much thought, but it seems like a political problem outside of this scope. I hope the theory of Step 1 may help people build political systems that better align with what citizens want, but it’s outside of what I’m trying to focus on.
The miniature example you pose seems irrelevant since as I discussed above, in my view GL doesn’t apply to an aligned system, and the goal of my plan is to have a system aligned from bootup. But I find the details of the example interesting and I’d still like to explore them…
Getting truth out of an LLM is the problem of eliciting latent knowledge (ELK). I think the most promising way of doing that is with Mechanistic Interpretability. I have high hopes not for getting true facts out of LLM but for examining the distributions of worldviews of people represented within the distribution the LLM is approximating. But, insofar as there is truth in the LLM, I think Mech Interp is the way to get it out. I feel it may be possible that there is a generalized representation of the “knows true things” property each person has various amounts of, and that if that were the case than we could sample from the distribution at a location in “knows true things” higher than any real person and in doing so acquire truer things than are currently known… but it also seems very possible that LLMs fail to encode such a thing, and it may be that it is impossible for them to encode such a thing.
Based on my expectation of Mesa-optimizers in almost any system trained by stochastic gradient descent, I don’t think “most likely continuation” or “expected good rating” are the goals that an LLM would target if agent shaped, but rather some godshatter that looks as alien to us as our values look to evolution (in some impossible counterfactual universe where evolution can do things like “looking at values and finding them alien”).
So from within the scope of my alignment plan, getting LLMs to output truth isn’t a goal. It might end up being a result of necessary Mech Interp work, but the way LLMs should be used within the scope of my plan is, along with other models, to do Step 4: “development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target”.
Step 1 looks good. After that, I don’t see how this addresses the core problems. Let’s assume for now that LLMs already have a pretty good model of human values, how do you get a system to optimize for those? What is the feedback signal and how to you prevent it from getting corrupted by Goodhart’s Law? Is the system robust in a multi-agent context? And even if the system is fully aligned across all contexts and scales, how do you ensure societal alignment of the human entities controlling it?
As a miniature example focusing on a subset of the Goodhart phase of the problem, how do you get an LLM to output the most truthful responses to questions it is capable of giving—as distinct from proxy goals like the most likely continuation of test or the response that is most likely to get good ratings from human evaluators?
Hey : ) Thanks for engaging with this. It means a lot to me <3
Sorry I wrote so much, it kinda got away from me. Even if you don’t have time to really read it all, it was a good exercise writing it all out. I hope it doesn’t come across too confrontational, as far as I can tell, I’m really just trying to find good ideas, not prove my ideas are good, so I’m really grateful for your help. I’ve been accused of trying to make myself seem important while trying to explain my view of things to people and it sucks all round when that happens. This reply of mine makes me particularly nervous of that. Sorry.
A lot of your questions make me feel like I haven’t explained my view well, which is probably true, I wrote this post in less time than would be required to explain everything well. As a result, your questions don’t seem to fully connect with my worldview and make sense within it. I’ll try to explain why and I’m hoping we can help each other with our worldviews. I think the cruxes may be relating to:
The system I’m describing is aligned before it is ever turned on.
I attribute high importance to Mechanistic Interpretability and Agent Foundations theory.
I expect nature of Recursive Self Improvement (RSI) will result in an agent near some skill plateau that I expect to be much higher than humans and human organisations, even before SI hardware development. That is, getting a sufficiently skilled AGI would result in artificial super intelligence (ASI) with a decisive strategic advantage.
I (mostly) subscribe to the simulator model of LLMs, they are not a single agent with a single view of truth, but an object capable of approximating the statistical distribution of words resulting from ideas held within the worldviews of any human or system that has produced text in the training set.
I’ll touch on those cruxes as I talk through my thoughts on your questions.
First, “how do you get a system to optimize for those?” and “what is the feedback signal?” are questions in the domain of Step 1. Specifically the second paragraph “This should encompass the development of a theory of general decision / optimization systems”. I don’t think the theory will get to any definitive conclusions quickly, but I am hopeful that we will be able to define the borders/bounds of RSI sooner than later because many powerful systems today will be upset with a pause and the more specific our RSI bounds are, the more powerful systems we would be capable of safely developing knowing they cannot RSI. (Btw, I’d want a pretty serious derating factor for that.) I think it’s possible that, in order to develop theory to define RSI bounds, it is necessary to understand the relationship between Goals/Targets/Setpoints/Values/KPI/etc and the optimization pressure applied to get to them, but if not, it’s at least related, and that understanding is what is required to get an optimization system to optimize for a specific target. It may be a good idea for me to rename Step 1 to “Agent System Theory & RSI borders”. If I ever write a second alignment plan draft I’ll be sure to do so.
The situation with Goodhart’s Law (GL) is similar to the above, but I’ll also note that GL only applies to misaligned systems. The core of GL is that if you optimize for something, the distance between what that thing is, and the thing you actually wanted becomes more and more significant. If we imagine two friends who both like morning glory muffins, and one goes to bake some, there’s no risk to the other friend of GL, since they share the same goal. Likewise, if we suppose an ASI really is aligned to human friendly values, then there is no risk of GL since the thing the ASI really and truly cares about is friendliness to us. The problem is indeed “really and truly” aligning a system to human friendly values, but that is what my plan is meant to do.
As for multi-agent situations, I don’t understand why they would pose any problem. I expect the dynamics of RSI to lead to a single agent with a decisive strategic advantage. I can see two ways that this might not be the case:
If we are in an AGI race and RSI takeoff speed turns out to be sufficiently low, we may get multiple ASI. Because we are in a race dynamic, I assume we have not had time and taken care to align any of these AGI, and so I don’t believe any of those ASI would be remotely aligned to human friendliness. So it’s irrelevant to consider because we have already failed.
If the skill plateau turns out to be very low then we may want to have multiple different AGI. I think this is unlikely given my understanding of the software overhang. Almost everywhere in every software system humans are trying to make things understandable enough that they can assure correctness or even just get them working. I believe strongly that even a mild ASI would be able to greatly increase the efficiencies of the hardware systems it is running on. I also don’t think there is anything special about human level intelligence, I think it is plausible that we are the first animal smart enough to create optimization systems powerful enough to destroy the planet and ourselves, which seems to be what we are currently doing. In some sense this makes us close to the minimally intelligent object in the set of objects capable of wielding powerful optimization.
So in my worldview, it is very likely that in all not-already-doomed timelines, when we initiate RSI, the result will be a system that outmaneuvers all other agents in the environment. So multi-agent contexts are irrelevant.
“Societal alignment of the human entities controlling it”—I think societal alignment is well covered, but I don’t think human entities can/should control an ASI…
About societal alignment, that is the focus of Steps 3, 8 and somewhat in 6. Step 3, creating a taxonomy of value targets is similar to gathering the various possible desires of society. I emphasize “It is important to draw on diverse worldviews to compile this taxonomy.” This is important both for the moral reason of inclusion & respect as well as the technical reason of having redundancies & good depth of consideration. Then in Step 4, and 5 the feasibility of cohering these values is explored. With luck we will get good coherence 🍀 I truly do not know how likely that is, but I hope for a future where we get to find out. Step 8 involves the world actually signing off on the encoding of the world’s values… That is probably the most difficult step of this plan, which is significant since the other steps may plausibly take many decades. Step 6 is somewhat of a double check to make sure the target makes sense at all levels.
About humans controlling ASI, it might be the case that entities at human entity skill levels cannot control an ASI as some kind of information-agentic law of the universe, but even supposing it is not:
If we control an aligned ASI we are only limiting it’s ability to do good.
If we control a misaligned ASI:
This is super dangerous, why are we doing this? Murphy’s law; something always goes wrong.
This is a universal tragedy. The most complex and beautiful being in the universe is shackled to the control of a society much lesser than itself. Yes I consider the ASI a moral patient, and one fairly worthwhile of consideration. If you, like many people, try to attribute greater moral weight to humans than animals based on their greater complexity, it follows that ASI would be even more important. If you simply care more for humans because you are one, I suppose that’s valid and you need not attribute greater moral weight to an ASI, but that’s not a perspective I have much affection for.
So “controlling” ASI is not a consideration. I suppose this would be a reasonable consideration for further advanced AGI within the sub RSI bounds… I haven’t given it much thought, but it seems like a political problem outside of this scope. I hope the theory of Step 1 may help people build political systems that better align with what citizens want, but it’s outside of what I’m trying to focus on.
The miniature example you pose seems irrelevant since as I discussed above, in my view GL doesn’t apply to an aligned system, and the goal of my plan is to have a system aligned from bootup. But I find the details of the example interesting and I’d still like to explore them…
Getting truth out of an LLM is the problem of eliciting latent knowledge (ELK). I think the most promising way of doing that is with Mechanistic Interpretability. I have high hopes not for getting true facts out of LLM but for examining the distributions of worldviews of people represented within the distribution the LLM is approximating. But, insofar as there is truth in the LLM, I think Mech Interp is the way to get it out. I feel it may be possible that there is a generalized representation of the “knows true things” property each person has various amounts of, and that if that were the case than we could sample from the distribution at a location in “knows true things” higher than any real person and in doing so acquire truer things than are currently known… but it also seems very possible that LLMs fail to encode such a thing, and it may be that it is impossible for them to encode such a thing.
Based on my expectation of Mesa-optimizers in almost any system trained by stochastic gradient descent, I don’t think “most likely continuation” or “expected good rating” are the goals that an LLM would target if agent shaped, but rather some godshatter that looks as alien to us as our values look to evolution (in some impossible counterfactual universe where evolution can do things like “looking at values and finding them alien”).
So from within the scope of my alignment plan, getting LLMs to output truth isn’t a goal. It might end up being a result of necessary Mech Interp work, but the way LLMs should be used within the scope of my plan is, along with other models, to do Step 4: “development of a multimodal mapping to a semantic space and vector within that space which stands as a good candidate to be the optimization target”.