If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won’t be alive at that point). But if you exclude most of humanity then most likely they’ll contribute their resources to their own AI projects so you’re starting with a small percent of power, and already losing most of potential value.
That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there’s the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity.
I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI’s motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don’t know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race.
I think all of the things I find unsatisfying above have analogues in Paul’s proposals, and I’ve commented about them on his blog. Please let me know if I can clarify anything.
If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years
It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for “good enough” which the people need to meet. I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right?
Is the issue just that they aren’t much better off than society outside of the box, and you think that it’s not good to pay a significant cost without getting some significant improvement?
Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI’s?
so keeping that box secure will be hard
Physical security of the box seems no harder than physical security of your AI’s hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box.
how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed
The point is that in order for the AI to work it needs to implement our views about “secure” / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of “secure,” but I don’t think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace “solving philosophical problems” in your arguments with “adequately assessing physical security;” is that right?
fail to do either and then the AI self-modifies badly or loses the coalitional race
I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming.
Right now this doesn’t look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation.
An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn’t necessary since it can probably be done acausally after the coin flip.)
As with the last case, we’ve now moved most of the difficulty into what we mean for either A or B to have “meaningful control” of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn’t look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains “meaningful control” over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don’t yet completely understand why you are so much more concerned about it.
And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade.
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don’t think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.)
I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. “Good enough to make any progress at all” isn’t good enough, if they’re just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)?
Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
Physical security of the box seems no harder than physical security of your AI’s hardware.
An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things.
If physical security is maintained, then you can simply not relay any messages to the inside of the box.
Aside from the above, I had in mind that it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition and use it to send a message into the box in a way that doesn’t trigger a security violation.
if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control.
Suppose A has utility linear in resources and B has utility log in resources, then moving to “flip a coin to decide which of A and B gets to use that control” makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A’s bargaining power. B almostly certainly shouldn’t go for this.
A more general objection is that you’re proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don’t fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that’s good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn’t leave it open to being exploited by other AIs?
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
So another disagreement between us that I forgot to list in my initial comment is that I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don’t fit into your AI control scheme?), and there are probably other problems that we can’t foresee right now.
One is being good enough to eventually solve all important philosophical problems.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
B almostly certainly shouldn’t go for this [the coin flip]
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
I guess proposing to hard code that into your AI as the only acceptable way to merge
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
E.g. if they can take steps towards safe cognitive enhancement
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m curious whether you see as the disanalogy between these cases
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
Maybe we should just go back to talking about your actual AI control proposals
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.
If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won’t be alive at that point). But if you exclude most of humanity then most likely they’ll contribute their resources to their own AI projects so you’re starting with a small percent of power, and already losing most of potential value.
That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there’s the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity.
I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI’s motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don’t know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race.
I think all of the things I find unsatisfying above have analogues in Paul’s proposals, and I’ve commented about them on his blog. Please let me know if I can clarify anything.
It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for “good enough” which the people need to meet. I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right?
Is the issue just that they aren’t much better off than society outside of the box, and you think that it’s not good to pay a significant cost without getting some significant improvement?
Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI’s?
Physical security of the box seems no harder than physical security of your AI’s hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box.
The point is that in order for the AI to work it needs to implement our views about “secure” / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of “secure,” but I don’t think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace “solving philosophical problems” in your arguments with “adequately assessing physical security;” is that right?
I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming.
Right now this doesn’t look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation.
An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn’t necessary since it can probably be done acausally after the coin flip.)
As with the last case, we’ve now moved most of the difficulty into what we mean for either A or B to have “meaningful control” of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn’t look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains “meaningful control” over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don’t yet completely understand why you are so much more concerned about it.
And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade.
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don’t think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.)
I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. “Good enough to make any progress at all” isn’t good enough, if they’re just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)?
Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things.
Aside from the above, I had in mind that it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition and use it to send a message into the box in a way that doesn’t trigger a security violation.
Suppose A has utility linear in resources and B has utility log in resources, then moving to “flip a coin to decide which of A and B gets to use that control” makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A’s bargaining power. B almostly certainly shouldn’t go for this.
A more general objection is that you’re proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don’t fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that’s good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn’t leave it open to being exploited by other AIs?
So another disagreement between us that I forgot to list in my initial comment is that I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don’t fit into your AI control scheme?), and there are probably other problems that we can’t foresee right now.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.