One is being good enough to eventually solve all important philosophical problems.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
B almostly certainly shouldn’t go for this [the coin flip]
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
I guess proposing to hard code that into your AI as the only acceptable way to merge
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
E.g. if they can take steps towards safe cognitive enhancement
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m curious whether you see as the disanalogy between these cases
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
Maybe we should just go back to talking about your actual AI control proposals
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.