I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).
i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values
I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box.
searching for fundamental obstructions to aligned AI
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)
I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.
I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).
It seems unlikely to me that alignment to complex human values comes for free.
The hope is to do a sublinear amount of additional work, not to get it for free.
It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed
It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t.
I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.
I still don’t get your position on this point, but we seem to be going around a bit in circles.
Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work).
Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)
If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement.
That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Can you be more explicit and formal about what you’re looking for?
Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.
The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.
If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?
The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI
My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.
indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign
This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.
directly through its own actions
If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)
Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)
Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.
I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.
I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).
I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box.
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)
I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).
The hope is to do a sublinear amount of additional work, not to get it for free.
It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t.
I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.
Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)
Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)
If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement.
I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.
Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.
If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?
My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.
This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.
If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)
Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)
Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.
I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.