Thanks for this very clear explanation of your thinking. A couple of followups if you don’t mind.
Unfortunately, I think that sort of analysis generally suggests that most of these sorts of training setups would end up giving you a deceptive model, or at least not the intended model.
Suppose the intended model is to predict H’s estimate at convergence, and the actual model is predicting H’s estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an “inner alignment failure”, an “outer alignment failure”, or something else (not an alignment failure)?
Putting these theoretical/conceptual questions aside, the reason I started thinking about this is from considering the following scenario. Suppose some humans are faced with a time-sensitive and highly consequential decision, for example, whether to join or support some proposed AI-based governance system (analogous to the 1690 “liberal democracy” question), or a hostile superintelligence is trying to extort all or most of their resources and they have to decide how to respond. It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)
What’s your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it’s orthogonal to alignment and should be studied in another branch of AI safety / AI risk?
Suppose the intended model is to predict H’s estimate at convergence, and the actual model is predicting H’s estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an “inner alignment failure”, an “outer alignment failure”, or something else (not an alignment failure)?
I would call that an inner alignment failure, since the model isn’t optimizing for the actual loss function, but I agree that the distinction is murky. (I’m currently working on a new framework that I really wish I could reference here but isn’t quite ready to be public yet.)
It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)
That’s a hard question to answer, and it really depends on how optimistic you are about generalization. If you just used current methods but scaled up, my guess is you would get deception and it would try to trick you. If we condition on it not being deceptive, I’d guess it was pursuing some weird proxies rather than actually trying to report the human equilibrium after any number of steps. If we condition on it actually trying to report the human equilibrium after some number of steps, though, my guess is that the simplest way to do that isn’t to have some finite cutoff, so I’d guess it’d do something like an expectation over exponentially distributed steps or something.
What’s your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it’s orthogonal to alignment and should be studied in another branch of AI safety / AI risk?
Definitely seems worth thinking about and taking seriously. Some thoughts:
Ideally, I’d like to just avoid making any decisions that lead to lock-in while we’re still figuring things out (e.g. wait to build anything like a sovereign for a long time). Of course, that might not be possible/realistic/etc.
Hopefully, this problem will just be solved as AI systems become more capable—e.g. if you have a way of turning any unaligned benchmark system into a new system that honestly/helpfully reports everything that the unaligned benchmark knows, then as the unaligned benchmark gets better, you should get better at making decisions with the honest/helpful system.
Thanks for this very clear explanation of your thinking. A couple of followups if you don’t mind.
Suppose the intended model is to predict H’s estimate at convergence, and the actual model is predicting H’s estimate at round N for some fixed N larger than any convergence time in the training set. Would you call this an “inner alignment failure”, an “outer alignment failure”, or something else (not an alignment failure)?
Putting these theoretical/conceptual questions aside, the reason I started thinking about this is from considering the following scenario. Suppose some humans are faced with a time-sensitive and highly consequential decision, for example, whether to join or support some proposed AI-based governance system (analogous to the 1690 “liberal democracy” question), or a hostile superintelligence is trying to extort all or most of their resources and they have to decide how to respond. It seems that convergence on such questions might take orders of magnitude more time than what M was trained on. What do you think would actually happen if the humans asked their AI advisor to help with a decision like this? (What are some outcomes you think are plausible?)
What’s your general thinking about this kind of AI risk (i.e., where an astronomical amount of potential value is lost because human-AI systems fail to make the right decisions in high-stakes situations that are eventuated by the advent of transformative AI)? Is this something you worry about as an alignment researcher, or do you (for example) think it’s orthogonal to alignment and should be studied in another branch of AI safety / AI risk?
I would call that an inner alignment failure, since the model isn’t optimizing for the actual loss function, but I agree that the distinction is murky. (I’m currently working on a new framework that I really wish I could reference here but isn’t quite ready to be public yet.)
That’s a hard question to answer, and it really depends on how optimistic you are about generalization. If you just used current methods but scaled up, my guess is you would get deception and it would try to trick you. If we condition on it not being deceptive, I’d guess it was pursuing some weird proxies rather than actually trying to report the human equilibrium after any number of steps. If we condition on it actually trying to report the human equilibrium after some number of steps, though, my guess is that the simplest way to do that isn’t to have some finite cutoff, so I’d guess it’d do something like an expectation over exponentially distributed steps or something.
Definitely seems worth thinking about and taking seriously. Some thoughts:
Ideally, I’d like to just avoid making any decisions that lead to lock-in while we’re still figuring things out (e.g. wait to build anything like a sovereign for a long time). Of course, that might not be possible/realistic/etc.
Hopefully, this problem will just be solved as AI systems become more capable—e.g. if you have a way of turning any unaligned benchmark system into a new system that honestly/helpfully reports everything that the unaligned benchmark knows, then as the unaligned benchmark gets better, you should get better at making decisions with the honest/helpful system.