Noosphere89 comments on TurnTrout’s shortform feed

Noosphere89 27 Sep 2024 1:37 UTC
3 points
2
I think what Turntrout is saying is that people on LW have a tendency to claim that the fact that future AI will be different from present AI means that they can start privileging the hypotheses about AI alignment they predicted years ago, when you can’t actually do this, and this is actually a problem I do see on LW quite a bit.
We’ve talked about this a bit before here, but one area where I do think we can generalize from LLMs to future models is about how they represent human values, and also how they handle human values, and one of the insights is that human values are both simpler in their generative structure, and also more data dependent than a whole lot of LWers thought years ago, which also suggests an immediate alignment strategy of training in dense data sets about human values using synthetic data either directly to the AI, or to use it to create a densely defined reward function that offers much less hackability opportunity than sparsely defined reward functions.
It’s not about LLM safety properties, but rather about us and our values that is the important takeaway, which is why I think they can be transferred over to different AI models even as AI becomes different as they progress:
https://www.lesswrong.com/posts/7fJRPB6CF6uPKMLWi/my-ai-model-delta-compared-to-christiano#LYyZm8JRJJ4F4wZSu
Cf this part of a comment by Linch, whch gets at my point on the surprising simplicity of human values while summarizing a post by Matthew Barnett:
Suppose in 2000 you were told that a100-line Python program (that doesn’t abuse any of the particular complexities embedded elsewhere in Python) can provide a perfect specification of human values. Then you should rationally conclude that human values aren’t actually all that complex (more complex than the clean mathematical statement, but simpler than almost everything else). In such a world, if inner alignment is solved, you can “just” train a superintelligent AI to “optimize for the results of that Python program” and you’d get a superintelligent AI with human values. Notably, alignment isn’t solved by itself. You still need to get the superintelligent AI to actually optimize for that Python program and not some random other thing that happens to have low predictive loss in training on that program. Well, in 2023 we have that Python program, with a few relaxations: The answer isn’t embedded in 100 lines of Python, but in a subset of the weights of GPT-4 Notably the human value function (as expressed by GPT-4) is necessarily significantly simpler than the weights of GPT-4, as GPT-4 knows so much more than just human values. What we have now isn’t a perfect specification of human values, but instead roughly the level of understanding of human values that a 85th percentile human can come up with. The human value function as expressed by GPT-4 is also immune to almost all in-practice, non-adversarial, perturbations We should then rationally update on the complexity of human values. It’s probably not much more complex than GPT-4, and possibly significantly simpler. ie, the fact that we have a pretty good description of human values well short of superintelligent AI means we should not expect a perfect description of human values to be very complex either.
The link is below:
https://www.lesswrong.com/posts/i5kijcjFJD6bn7dwq/evaluating-the-historical-value-misspecification-argument#N9ManBfJ7ahhnqmu7
(I also disagree with the assumption that scaling up AI is more dangerous than scaling up humans, but that’s something I’ll leave for another day.)