I would have really appreciated documentation on this, fwiw!
David Rein
Hmm yeah that’s fair, but I think what I said stands as a critique of a certain perspective on alignment, insofar as I think having the alignment curve grow faster at every step is equivalent to solving the core hard problem. I agree that we need to solve the core hard problem, but we need to delay fast takeoff until we are very confident that the problems are solved.
‘The goal of alignment research should be to get us into “alignment escape velocity”, which is where the rate of alignment progress (which will largely come from AI as we progress) is fast enough to prevent doom for enough time to buy even more time.’
^ the above argument only works if you think that there will be a relatively slow takeoff. If there is a fast takeoff, the only way to buy more time is to delay that takeoff, because alignment won’t scale as quickly as capabilities under a period of significant and rapid recursive self-improvement.
- Sep 11, 2022, 2:58 AM; 2 points) 's comment on Morpheus’s Shortform by (
Alignment is a stabilizing force against fast takeoff, because the models will not want to train models that don’t do what *they* want. So, the goals/values of the superintelligence we get after a takeoff might actually end up being the values of models that are just past the point of capability where they are able to align their successors. I’d expect these values to be different from the values of the initial model that started the recursive self-improvement process, because I don’t expect that initial model to be capable of solving (or caring about) alignment enough, and because there may competitive dynamics that cause ~human-level AI to train successors that are misaligned to it.
phone.spinning’s Shortform
I think AI of the capability level that you describe will either already have little need to exploit people, or will quickly train successors that wouldn’t benefit from this. I do think deception is a big issue, but I think the important parts of deception will be earlier in terms of AI capability than you describe.
Which suggests that if you’re doing randomish exploration, you should try to shake things up and move in a bunch of dimensions at once rather than just moving along a single identified dimension.
If you can only do randomish exploration this sounds right, but I think this often isn’t the right approach (not saying you would disagree with this, just pointing it out). When we change things along basis vectors, we’re implicitly taking advantage of the fact that we have a built-in basis for the world (namely, our general world model). This lets us reason about things like causality, constraints, etc. since we already are parsing the world into a useful basis.
Epistemic status: I’m somewhat confident this is a useful axis to describe/consider alignment strategies/perspectives, but I’m pretty uncertain which is better. I could be missing important considerations, or weighing the considerations listed inaccurately.
When thinking about technical alignment strategy, I’m unsure whether it’s better to focus on trying to align systems that already exist (and are misaligned), or to focus on procedures that train aligned models from the start.
The first case is harder, which means focusing on it is closer to minimax optimization (minimize the badness of the worst-case outcome), which is often a good metric to optimize. Plus, it could be argued that it’s more realistic, because the current AI paradigm of “learn a bunch of stuff about the world with a SSL prediction loss, then fine-tune it to point at actually useful tasks” is very similar to this, and might require us to align pre-trained unaligned systems.
However, I don’t think only focusing on this paradigm is necessarily correct. I think it’s important we have methods that can train (more-or-less) from scratch (and still be roughly competitive with other methods). I’m uncertain how hard fine-tuning SSL prediction models is, and I think the frame of “we’re training models to be a certain way” admits different solutions and strategies than “we need to align a potential superintelligence”.
One way to cache this out more concretely is the extent to which you view amplification or debate as fine-tuning for alignment, versus a method of learning new capabilities (while maintaining alignment properties).