I haven’t read the Shard Theory work in comprehensive detail. But, fwiw I’ve read at least a fair amount of your arguments here and not seen anything that bridged the gap between “motivations are made of shards that are contextually activated” and “we don’t need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization.”
I’ve heard you make this basic argument several times, and my sense is you’re pretty frustrated that people still don’t seem to have “heard” it properly, or something. I currently feel like I have heard it, and don’t find it compelling.
I did feel compelled by your argument that we should look to humans as an example of how “human values” got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate.
But, like, a) I don’t actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn’t seem like the gist of “look to humans” gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren’t at a stage where I’d expect them to exhibit any of the properties I’d be worried about.
(Also, in your shard theory post, you skip over the example of ‘embarassment’ because you can’t explain it yet, and switch to sugar, and I’m like ‘but, the embarrassment one was much more cruxy and important!’)
I don’t expect to get to agreement in the comments here today, but, it feels like the current way you’re arguing this point just isn’t landing or having the effect you want and… I dunno what would resolve things for you or anyone else but I think it’d be better if you tried some different things for arguing about this point.
If you feel like you’ve explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It’s been awhile since I read your Diamond Maximizer post, which I don’t remember in detail but don’t remember finding compelling at the time)
I think what Turntrout is saying is that people on LW have a tendency to claim that the fact that future AI will be different from present AI means that they can start privileging the hypotheses about AI alignment they predicted years ago, when you can’t actually do this, and this is actually a problem I do see on LW quite a bit.
We’ve talked about this a bit before here, but one area where I do think we can generalize from LLMs to future models is about how they represent human values, and also how they handle human values, and one of the insights is that human values are both simpler in their generative structure, and also more data dependent than a whole lot of LWers thought years ago, which also suggests an immediate alignment strategy of training in dense data sets about human values using synthetic data either directly to the AI, or to use it to create a densely defined reward function that offers much less hackability opportunity than sparsely defined reward functions.
It’s not about LLM safety properties, but rather about us and our values that is the important takeaway, which is why I think they can be transferred over to different AI models even as AI becomes different as they progress:
Cf this part of a comment by Linch, whch gets at my point on the surprising simplicity of human values while summarizing a post by Matthew Barnett:
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else). In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values. Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program. Well, in 2023 we have that Python program, with a few relaxations: The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
I haven’t read the Shard Theory work in comprehensive detail. But, fwiw I’ve read at least a fair amount of your arguments here and not seen anything that bridged the gap between “motivations are made of shards that are contextually activated” and “we don’t need to worry about Goodhart and misgeneralization of human values at extreme levels of optimization.”
I’ve heard you make this basic argument several times, and my sense is you’re pretty frustrated that people still don’t seem to have “heard” it properly, or something. I currently feel like I have heard it, and don’t find it compelling.
I did feel compelled by your argument that we should look to humans as an example of how “human values” got aligned. And it seems at least plausible that we are approaching a regime where the concrete nitty-gritty of prosaic ML can inform our overall alignment models in a way that makes the thought experiments of 2010 outdate.
But, like, a) I don’t actually think most humans are automatically aligned if naively scaled up (though it does seem safer than naive AI scaling), and b) while human-value-formation might be simpler than the Yudkowskian model predicts, it still doesn’t seem like the gist of “look to humans” gets us to a plan that is simple in absolute terms, c) it seems like there are still concrete reasons to expect training superhuman models to be meaningfully quite different from the current LLMs, which aren’t at a stage where I’d expect them to exhibit any of the properties I’d be worried about.
(Also, in your shard theory post, you skip over the example of ‘embarassment’ because you can’t explain it yet, and switch to sugar, and I’m like ‘but, the embarrassment one was much more cruxy and important!’)
I don’t expect to get to agreement in the comments here today, but, it feels like the current way you’re arguing this point just isn’t landing or having the effect you want and… I dunno what would resolve things for you or anyone else but I think it’d be better if you tried some different things for arguing about this point.
If you feel like you’ve explained the details of those things better in the second half of one of your posts, I will try giving it a more thorough read. (It’s been awhile since I read your Diamond Maximizer post, which I don’t remember in detail but don’t remember finding compelling at the time)
I think what Turntrout is saying is that people on LW have a tendency to claim that the fact that future AI will be different from present AI means that they can start privileging the hypotheses about AI alignment they predicted years ago, when you can’t actually do this, and this is actually a problem I do see on LW quite a bit.
We’ve talked about this a bit before here, but one area where I do think we can generalize from LLMs to future models is about how they represent human values, and also how they handle human values, and one of the insights is that human values are both simpler in their generative structure, and also more data dependent than a whole lot of LWers thought years ago, which also suggests an immediate alignment strategy of training in dense data sets about human values using synthetic data either directly to the AI, or to use it to create a densely defined reward function that offers much less hackability opportunity than sparsely defined reward functions.
It’s not about LLM safety properties, but rather about us and our values that is the important takeaway, which is why I think they can be transferred over to different AI models even as AI becomes different as they progress:
https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano#LYyZm8JRJJ4F4wZSu
Cf this part of a comment by Linch, whch gets at my point on the surprising simplicity of human values while summarizing a post by Matthew Barnett:
The link is below:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
(I also disagree with the assumption that scaling up AI is more dangerous than scaling up humans, but that’s something I’ll leave for another day.)