In Chapter 3, we may be dealing with systems that are capable enough to rapidly and decisively undermine our safety and security if they are misaligned. So, before the end of Chapter 2, we will need to have either fully, perfectly solved the core challenges of alignment, or else have fully, perfectly solved some related (and almost as difficult) goal like corrigibility that rules out a catastrophic loss of control. This work could look quite distinct from the alignment research in Chapter 1: We will have models to study that are much closer to the models that we’re aiming to align
I don’t see why we need to “perfectly” and “fully” solve “the” core challenges of alignment (as if that’s a thing that anyone knows exists). Uncharitably, it seems like many people (and I’m not mostly thinking of Sam here) have their empirically grounded models of “prosaic” AI, and then there’s the “real” alignment regime where they toss out most of their prosaic models and rely on plausible but untested memes repeated from the early days of LessWrong.
I haven’t read the Shard Theory work in comprehensive detail. But, fwiw I’ve read at least a fair amount of your arguments here and not seen anything that bridged the gap between “motivations are made of shards that are contextually activated” and “we don’t need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization.”
I’ve heard you make this basic argument several times, and my sense is you’re pretty frustrated that people still don’t seem to have “heard” it properly, or something. I currently feel like I have heard it, and don’t find it compelling.
I did feel compelled by your argument that we should look to humans as an example of how “human values” got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate.
But, like, a) I don’t actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn’t seem like the gist of “look to humans” gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren’t at a stage where I’d expect them to exhibit any of the properties I’d be worried about.
(Also, in your shard theory post, you skip over the example of ‘embarassment’ because you can’t explain it yet, and switch to sugar, and I’m like ‘but, the embarrassment one was much more cruxy and important!’)
I don’t expect to get to agreement in the comments here today, but, it feels like the current way you’re arguing this point just isn’t landing or having the effect you want and… I dunno what would resolve things for you or anyone else but I think it’d be better if you tried some different things for arguing about this point.
If you feel like you’ve explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It’s been awhile since I read your Diamond Maximizer post, which I don’t remember in detail but don’t remember finding compelling at the time)
I think what Turntrout is saying is that people on LW have a tendency to claim that the fact that future AI will be different from present AI means that they can start privileging the hypotheses about AI alignment they predicted years ago, when you can’t actually do this, and this is actually a problem I do see on LW quite a bit.
We’ve talked about this a bit before here, but one area where I do think we can generalize from LLMs to future models is about how they represent human values, and also how they handle human values, and one of the insights is that human values are both simpler in their generative structure, and also more data dependent than a whole lot of LWers thought years ago, which also suggests an immediate alignment strategy of training in dense data sets about human values using synthetic data either directly to the AI, or to use it to create a densely defined reward function that offers much less hackability opportunity than sparsely defined reward functions.
It’s not about LLM safety properties, but rather about us and our values that is the important takeaway, which is why I think they can be transferred over to different AI models even as AI becomes different as they progress:
Cf this part of a comment by Linch, whch gets at my point on the surprising simplicity of human values while summarizing a post by Matthew Barnett:
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else). In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values. Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program. Well, in 2023 we have that Python program, with a few relaxations: The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
I don’t see why we need to “perfectly” and “fully” solve “the” core challenges of alignment (as if that’s a thing that anyone knows exists). Uncharitably, it seems like many people (and I’m not mostly thinking of Sam here) have their empirically grounded models of “prosaic” AI, and then there’s the “real” alignment regime where they toss out most of their prosaic models and rely on plausible but untested memes repeated from the early days of LessWrong.
Some potential alternative candidates to e.g. ‘goal-directedness’ and related ‘memes’ (that at least I’d personally find probably more likely—though I’m probably overall closer to the ‘prosaic’ and ‘optimistic’ sides):
Systems closer to human-level could be better at situational awareness and other prerequisites for scheming, which could make it much more difficult to obtain evidence about their alignment from purely behavioral evaluations (which do seem among the main sources of evidence for safety today, alongside inability arguments). While e.g. model internals techniques don’t seem obviously good enough yet to provide similar levels of evidence and confidence.
Powerful systems will probably be capable of automating at least some parts of AI R&D and there will likely be strong incentives to use such automation. This could lead to e.g. weaker human oversight, faster AI R&D progress and takeoff, breaking of some favorable properties for safety (e.g. maybe new architectures are discovered where less CoT and similar intermediate outputs are needed, so the systems are less transparent), etc. It also seems at least plausible that any misalignment could be magnified by this faster pace, though I’m unsure here—e.g. I’m quite optimistic about various arguments about corrigibility, ‘broad basins of alignment’ and the like.
I quite appreciated Sam Bowman’s recent Checklist: What Succeeding at AI Safety Will Involve. However, one bit stuck out:
I don’t see why we need to “perfectly” and “fully” solve “the” core challenges of alignment (as if that’s a thing that anyone knows exists). Uncharitably, it seems like many people (and I’m not mostly thinking of Sam here) have their empirically grounded models of “prosaic” AI, and then there’s the “real” alignment regime where they toss out most of their prosaic models and rely on plausible but untested memes repeated from the early days of LessWrong.
Alignment started making a whole lot more sense to me when I thought in mechanistic detail about how RL+predictive training might create a general intelligence. By thinking in that detail, my risk models can grow along with my ML knowledge.
I haven’t read the Shard Theory work in comprehensive detail. But, fwiw I’ve read at least a fair amount of your arguments here and not seen anything that bridged the gap between “motivations are made of shards that are contextually activated” and “we don’t need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization.”
I’ve heard you make this basic argument several times, and my sense is you’re pretty frustrated that people still don’t seem to have “heard” it properly, or something. I currently feel like I have heard it, and don’t find it compelling.
I did feel compelled by your argument that we should look to humans as an example of how “human values” got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate.
But, like, a) I don’t actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn’t seem like the gist of “look to humans” gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren’t at a stage where I’d expect them to exhibit any of the properties I’d be worried about.
(Also, in your shard theory post, you skip over the example of ‘embarassment’ because you can’t explain it yet, and switch to sugar, and I’m like ‘but, the embarrassment one was much more cruxy and important!’)
I don’t expect to get to agreement in the comments here today, but, it feels like the current way you’re arguing this point just isn’t landing or having the effect you want and… I dunno what would resolve things for you or anyone else but I think it’d be better if you tried some different things for arguing about this point.
If you feel like you’ve explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It’s been awhile since I read your Diamond Maximizer post, which I don’t remember in detail but don’t remember finding compelling at the time)
I think what Turntrout is saying is that people on LW have a tendency to claim that the fact that future AI will be different from present AI means that they can start privileging the hypotheses about AI alignment they predicted years ago, when you can’t actually do this, and this is actually a problem I do see on LW quite a bit.
We’ve talked about this a bit before here, but one area where I do think we can generalize from LLMs to future models is about how they represent human values, and also how they handle human values, and one of the insights is that human values are both simpler in their generative structure, and also more data dependent than a whole lot of LWers thought years ago, which also suggests an immediate alignment strategy of training in dense data sets about human values using synthetic data either directly to the AI, or to use it to create a densely defined reward function that offers much less hackability opportunity than sparsely defined reward functions.
It’s not about LLM safety properties, but rather about us and our values that is the important takeaway, which is why I think they can be transferred over to different AI models even as AI becomes different as they progress:
https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano#LYyZm8JRJJ4F4wZSu
Cf this part of a comment by Linch, whch gets at my point on the surprising simplicity of human values while summarizing a post by Matthew Barnett:
The link is below:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
(I also disagree with the assumption that scaling up AI is more dangerous than scaling up humans, but that’s something I’ll leave for another day.)
Some potential alternative candidates to e.g. ‘goal-directedness’ and related ‘memes’ (that at least I’d personally find probably more likely—though I’m probably overall closer to the ‘prosaic’ and ‘optimistic’ sides):
Systems closer to human-level could be better at situational awareness and other prerequisites for scheming, which could make it much more difficult to obtain evidence about their alignment from purely behavioral evaluations (which do seem among the main sources of evidence for safety today, alongside inability arguments). While e.g. model internals techniques don’t seem obviously good enough yet to provide similar levels of evidence and confidence.
Powerful systems will probably be capable of automating at least some parts of AI R&D and there will likely be strong incentives to use such automation. This could lead to e.g. weaker human oversight, faster AI R&D progress and takeoff, breaking of some favorable properties for safety (e.g. maybe new architectures are discovered where less CoT and similar intermediate outputs are needed, so the systems are less transparent), etc. It also seems at least plausible that any misalignment could be magnified by this faster pace, though I’m unsure here—e.g. I’m quite optimistic about various arguments about corrigibility, ‘broad basins of alignment’ and the like.