Assumming that there is an “alignment homework” to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no “alignment homework” that needs to get done (“alignment by default”, “AGI being impossible”, etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question—namely, because they think that the homework to be done isn’t particularly difficult in the first place.
For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters). However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI’s “superalignment” approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens. And my point here is not to argue that this won’t happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)
To put it in a different way:
We can use AI to automate an existing process, or a process that we can describe in enough detail. (EG, suppose we want to “automate science”. Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to “magically make science better”.)
We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1] (EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether p1⋅p2 is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it’s not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of “making AI go well”.)
Finally, suppose you think that the problem with “making AI go well” is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]
A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.
The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities—for example, because we know how to evaluate capabilities research, while the field of “make AI go well” is much less mature.
Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.
I think there is an important distinction between “If given substantial investment, would the plan to use the AIs to do alignment research work?” and “Will it work in practice given realistic investment?”.
The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability.
This is a substantial cost, but it’s not an insanely high cost.
I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don’t understand or perhaps disagree with.)
Are you saying something like: “Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn’t hard in the first place).”?
If yes, then I agree—but I feel that of the two questions, “would the plan work in theory” is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won’t really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research—then this won’t really matter unless the US government actually does that.)
I just think that these are important concepts to distinguish because I think it’s useful to notice the extent to which problems could be solved by moderate amount of coordination and which asks could suffice for safety.
I wasn’t particularly trying to make a broader claim, just trying to highlight something that seemed important.
My overall guess is that people paying costs equivalent to 2 years of delay for existential safety reasons is about 50% likely. (Though I’m uncertain overall and this is possible to influence.) Thus, ensuring that the plan for spending that budget is as good as possible looks quite good. And not hopeless overall.
By analogy, note that google bears substantial costs to improve security (e.g. running 10% slower).
I think that if we could ensure the implementation of our best safety plans which just cost a few years of delay, we’d be in a much better position.
Assumming that there is an “alignment homework” to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no “alignment homework” that needs to get done (“alignment by default”, “AGI being impossible”, etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question—namely, because they think that the homework to be done isn’t particularly difficult in the first place.
For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI’s “superalignment” approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won’t happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)
To put it in a different way:
We can use AI to automate an existing process, or a process that we can describe in enough detail.
(EG, suppose we want to “automate science”. Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to “magically make science better”.)
We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1]
(EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether p1⋅p2 is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it’s not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of “making AI go well”.)
Finally, suppose you think that the problem with “making AI go well” is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]
A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.
The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities—for example, because we know how to evaluate capabilities research, while the field of “make AI go well” is much less mature.
I think there is an important distinction between “If given substantial investment, would the plan to use the AIs to do alignment research work?” and “Will it work in practice given realistic investment?”.
The cost of the approach where the AIs do alignment research might look like 2 years of delay in median worlds and perhaps considerably more delay with some probability.
This is a substantial cost, but it’s not an insanely high cost.
I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don’t understand or perhaps disagree with.)
Are you saying something like: “Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn’t hard in the first place).”?
If yes, then I agree—but I feel that of the two questions, “would the plan work in theory” is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won’t really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research—then this won’t really matter unless the US government actually does that.)
I just think that these are important concepts to distinguish because I think it’s useful to notice the extent to which problems could be solved by moderate amount of coordination and which asks could suffice for safety.
I wasn’t particularly trying to make a broader claim, just trying to highlight something that seemed important.
My overall guess is that people paying costs equivalent to 2 years of delay for existential safety reasons is about 50% likely. (Though I’m uncertain overall and this is possible to influence.) Thus, ensuring that the plan for spending that budget is as good as possible looks quite good. And not hopeless overall.
By analogy, note that google bears substantial costs to improve security (e.g. running 10% slower).
I think that if we could ensure the implementation of our best safety plans which just cost a few years of delay, we’d be in a much better position.