This seems a good opportunity for me to summarize my disagreements with both Paul and MIRI. In short, there are two axes along which Paul and MIRI disagree with each other, where I’m more pessimistic than either of them.
(One of Paul’s latest replies to me on his AI control blog says “I have become more pessimistic after thinking it through somewhat more carefully.” and “If that doesn’t look good (and it probably won’t) I will have to step back and think about the situation more broadly.” I’m currently not sure how broadly Paul was going to rethink the situation or what conclusions he has since reached. What follows is meant to reflect my understanding of his positions up to those statements.)
One axis might be called “metaphilosophical paternalism” (a phrase I just invented, not sure if there’s an existing one I should use), i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values (which implies correctly solving all relevant philosophical dependencies such as population ethics and philosophy of consciousness) and how hard will it be to design and provide such support / error correction.
MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue. Paul’s position went from his 2012 version of “indirect normativity” which envisioned placing a human in a relatively benign simulated environment (although still very different from the kinds of environments where we have historical evidence of humans being able to make philosophical progress in) to his current ideas where humans live in very hostile environments, having to process potentially adversarial messages from superintelligent AIs under time pressure.
My own thinking is that we currently know very little about metaphilosophy, essentially nothing beyond that philosophy is some kind of computational / cognitive process implemented by (at least some) human brains, and there seems to be such a thing as philosophical truth or philosophical progress, but that is hard to define or even recognize. Without easy ways to check one’e ideas (e.g., using controlled experiments or mathematical proofs), human cognitive processes tend to diverge rather than converge. (See political and religious beliefs, for example.) If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have). Think of how confused we still are about how expected utility maximization applies in bargaining, or what priors really are or should be, many decades after those ideas were first proposed. I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.
The other axis of disagreement is how feasible it would be to create aligned AI that matches or beats unaligned AI in efficiency/capability. Here Paul is only trying to match unaligned AIs using the same mainstream AI techniques, whereas MIRI is trying to beat unaligned AIs in order to prevent them from undergoing intelligence explosion. But even Paul is more optimistic than I think is warranted. (To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them.) It seems unlikely to me that alignment to complex human values comes for free. If nothing else, aligned AIs will be more complex than unaligned AIs and such complexity is costly in design, coding, maintenance, and security. Think of the security implications of having a human controller or a complex value extrapolation process at an AI’s core, compared to something simpler like a paperclip maximizer, or the continuous challenges of creating improved revisions of AI design while minimizing the risk of losing alignment to a set of complex and unknown values.
Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons. Maybe the research can show that certain approaches to building competitive aligned AIs won’t succeed, but realistically such a result can only hope to cover a tiny part of AI design space, so I don’t see why that kind of result would be particularly valuable.
Please note that what I wrote here isn’t meant to be an argument against doing the kind of research that Paul and MIRI are doing. It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed. Otherwise, since those preconditions don’t seem very likely to actually obtain, we’re leaving huge amounts of potential expected value on the table if we bank on just one or even both of these approaches.
Weighing in late here, I’ll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) “for the love of all that is good, please don’t attempt to implement CEV with your first transhuman intelligence”. My strategy at this point is very much “build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future.” I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that’s a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity’s very first transhuman intelligence.
Also, +1 to the overall point of “also pursue other approaches”.
MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue.
Note that Eliezer is currently more optimistic about task AGI than CEV (for the first AGI built), and I think Nate is too. I’m not sure what Benya thinks.
Oh, right, I had noticed that, and then forgot and went back to my previous model of MIRI. I don’t think Eliezer ever wrote down why he changed his mind about task AGI or how he is planning to use one. If the plan is something like “buy enough time to work on CEV at leisure”, then possibly I have much less disagreement on “metaphilosophical paternalism” with MIRI than I thought.
If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have).
Consider the following strategy the AI could take:
Put a bunch of humans in a secure box containing food/housing/etc
Acquire as much power as possible while keeping the box intact
After 100 years, ask the humans in the box what to do next
There are lots of things that are unsatisfying about the proposal (e.g. the fact that only the humans in the box survive), but I’m curious which you find least satisfying (especially unsatisfying things that are also unsatisfying about Paul’s proposals). Do you think designing this AI will require solving metaphilosophical problems? Do you think this AI will be at a substantial efficiency disadvantage relative to a paperclip maximizer?
(Note that this doesn’t require humans to figure out their actual values in 100 years; they can decide some questions and kick the rest to another 100 years later)
If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won’t be alive at that point). But if you exclude most of humanity then most likely they’ll contribute their resources to their own AI projects so you’re starting with a small percent of power, and already losing most of potential value.
That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there’s the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity.
I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI’s motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don’t know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race.
I think all of the things I find unsatisfying above have analogues in Paul’s proposals, and I’ve commented about them on his blog. Please let me know if I can clarify anything.
If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years
It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for “good enough” which the people need to meet. I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right?
Is the issue just that they aren’t much better off than society outside of the box, and you think that it’s not good to pay a significant cost without getting some significant improvement?
Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI’s?
so keeping that box secure will be hard
Physical security of the box seems no harder than physical security of your AI’s hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box.
how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed
The point is that in order for the AI to work it needs to implement our views about “secure” / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of “secure,” but I don’t think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace “solving philosophical problems” in your arguments with “adequately assessing physical security;” is that right?
fail to do either and then the AI self-modifies badly or loses the coalitional race
I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming.
Right now this doesn’t look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation.
An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn’t necessary since it can probably be done acausally after the coin flip.)
As with the last case, we’ve now moved most of the difficulty into what we mean for either A or B to have “meaningful control” of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn’t look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains “meaningful control” over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don’t yet completely understand why you are so much more concerned about it.
And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade.
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don’t think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.)
I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. “Good enough to make any progress at all” isn’t good enough, if they’re just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)?
Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
Physical security of the box seems no harder than physical security of your AI’s hardware.
An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things.
If physical security is maintained, then you can simply not relay any messages to the inside of the box.
Aside from the above, I had in mind that it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition and use it to send a message into the box in a way that doesn’t trigger a security violation.
if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control.
Suppose A has utility linear in resources and B has utility log in resources, then moving to “flip a coin to decide which of A and B gets to use that control” makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A’s bargaining power. B almostly certainly shouldn’t go for this.
A more general objection is that you’re proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don’t fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that’s good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn’t leave it open to being exploited by other AIs?
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
So another disagreement between us that I forgot to list in my initial comment is that I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don’t fit into your AI control scheme?), and there are probably other problems that we can’t foresee right now.
One is being good enough to eventually solve all important philosophical problems.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
B almostly certainly shouldn’t go for this [the coin flip]
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
I guess proposing to hard code that into your AI as the only acceptable way to merge
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
E.g. if they can take steps towards safe cognitive enhancement
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m curious whether you see as the disanalogy between these cases
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
Maybe we should just go back to talking about your actual AI control proposals
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.
Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that was designed without the benefit of understanding these principles.
(Of course, the humans shouldn’t be physically there, or it will be too hard to say what it means to keep them safe, but making accurate uploads and packaging the 100 years as a pure computation solves this issue without any conceptual difficulty.)
A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly.
It’s not clear to me why “limited scope” and “can be replaced” are related. An agent with broad scope can still be optimizing something like “what the human would want me to do today” and the human could have preferences like “now that humans believe that an alternative design would have been better, gracefully step aside.” (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.)
Being able to “gracefully step aside” (to be replaced) is an example of what I meant by “limited scope” (in time). Even if AI’s scope is “broad”, the crucial point is that it’s not literally everything (and by default it is). In practice it shouldn’t be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn’t get “optimized” into something else.)
I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).
i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values
I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box.
searching for fundamental obstructions to aligned AI
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)
I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.
I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).
It seems unlikely to me that alignment to complex human values comes for free.
The hope is to do a sublinear amount of additional work, not to get it for free.
It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed
It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t.
I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.
I still don’t get your position on this point, but we seem to be going around a bit in circles.
Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) –> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work).
Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)
If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement.
That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Can you be more explicit and formal about what you’re looking for?
Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.
The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.
If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?
The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI
My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.
indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign
This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.
directly through its own actions
If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)
Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)
Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.
I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.
I really agree with #2 (and I think with #1, as well, but I’m not as sure I understand your point there).
I’ve been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn’t seem obvious to most… but I haven’t really considered that “efficient aligned AIs almost certainly exist as points in mindspace”. In fact I’m not sure I agree 100% (basically because “Moloch” (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).
I think “trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed” remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really “takes off”.
I’ve been trying to convince people that there will be strong trade-offs between safety and performance
What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).
Some caveats:
It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.
Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.
Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?
I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.
(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).
Personally, I tend to think that we ought to address the coordination problem head-on
I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.
My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.
I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.
If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.
If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).
To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them.
My model of Nate thinks the path to victory goes through the aligned AI project gaining a substantial first mover advantage (through fast local takeoff, more principled algorithms, and/or better coordination). Though he’s also quite concerned about extremely large efficiency disadvantages of aligned AI vs unaligned AI (e.g. he’s pessimistic about act-based agents helping much because they might require the AI to be good at predicting humans doing complex tasks such as research).
Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons.
In this case I expect that in <10 years we get something like: “we tried making aligned versions of a bunch of algorithms, but the aligned versions are always less powerful because they left out some source of power the unaligned versions had. We iterated the process a few times (studying the additional sources of power and making aligned versions of them), and this continued to be the case. We have good reasons to believe that there isn’t a sensible stopping point to this process.” This seems pretty close to a fundamental obstruction and it seems like it would be similarly useful, especially if the “good reasons to believe there isn’t a sensible stopping point to this process” tell us something new about which relaxations are promising.
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
Even beyond Jessica’s point (that failure to improve our understanding would constitute an observable failure), I don’t completely buy this.
We are talking about AI safety because there are reasons to think that AI systems will cause a historically unprecedented kind of problem. If we could design systems for which we had no reason to expect them to cause such problems, then we can rest easy.
I don’t think there is some kind of magical and unassailable reason to be suspicious of powerful AI systems, there are just a bunch of particular reasons to be concerned.
Similarly, there is no magical reason to expect a treacherous turn—this is one of the kinds of unusual failures which we have reason to be concerned about. If we built a system for which we had no reason to be concerned, then we shouldn’t be concerned.
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
The only point on which there is plausible disagreement is “utility-maximizing agents.” On a narrow reading of “utility-maximizing agents” it is not clear why it would be important to getting more powerful performance.
On a broad reading of “utility-maximizing agents” I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don’t agree with the claim that we will be unable to reliably tell that such agents aren’t dangerous without theoretical progress.
In particular, there is an argument of the form “the prospect of a treacherous turn makes any informal analysis unreliable.” I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.)
Paul, I’m not sure I understand what you’re saying here. Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?
The reason AI systems will cause a historically unprecedented kind of problem, is that AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control. In order for such a system be safe, we need to know that it will not attempt anything detrimental to us, and we need to know this as an abstraction, i.e without knowing in details what the system will do (because the system is superintelligent so we by definition we cannot guess its actions).
Doesn’t it seem improbable to you that we will have a way of having such knowledge by some other means than the accuracy of mathematical thought?
That is, we can have a situation like “AI running in homomorphic encryption with a quantum-generated key that is somewhere far from the AI’s computer” where it’s reasonable claim that the AI is safe as long as it stays encrypted (even though there is still some risk from being wrong about cryptographic conjectures or the AI exploiting some surprising sort of unknown physics), without using a theory of intelligence at all (beyond the fact that intelligence is a special case of computation), but it seems unlikely that we can have something like this while simultaneously having the AI powerful enough to protect us against other AIs that are malicious.
Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?
Yes. For example, suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent. Depending on the human being imitated, such a system could end up seeming unproblematic even without any new theoretical understanding.
We don’t yet see any way to build such a system, much less to do so in a way that could be competitive with the best RL system that could be designed at a given level of technology. But I can certainly imagine it.
(Obviously I think there is a much larger class of systems that might be non-problematic, though it may depend on what we mean by “underlying mathematical theory.”)
AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control
This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources.
That is a very special thing for a system to do, above and beyond being able to accomplish tasks that apparently require intelligence. Currently we don’t have any way to accomplish the goals of AI that don’t risk this failure mode, but it’s not obvious that it is necessary.
Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?
...suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent.
This doesn’t seem to be a valid example: your system is not superintelligent, it is “merely” human. That is, I can imagine solving AI risk by building whole brain emulations with enormous speed-up and using them to acquire absolute power. However:
I think this is not what is usually meant by “solving AI alignment.”
The more you use heuristic learning algorithms instead of “classical” brain emulation the more I would be worried your algorithm does something subtly wrong in a way that distorts values, although that would also invalidate the condition that “there is no other reason to believe that it is intelligent.”
There is a high-risk zone here where someone untrustworthy can gain this technology and use it to unwittingly create unfriendly AI.
AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control
This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources.
Well, any AI is effectively optimizing for some goal by definition. How do you know this goal is “human”? In particular, if your AI is supposed to defend us from other AIs, it is very much in the business of acquiring and protecting resources.
As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing
If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)
This seems a good opportunity for me to summarize my disagreements with both Paul and MIRI. In short, there are two axes along which Paul and MIRI disagree with each other, where I’m more pessimistic than either of them.
(One of Paul’s latest replies to me on his AI control blog says “I have become more pessimistic after thinking it through somewhat more carefully.” and “If that doesn’t look good (and it probably won’t) I will have to step back and think about the situation more broadly.” I’m currently not sure how broadly Paul was going to rethink the situation or what conclusions he has since reached. What follows is meant to reflect my understanding of his positions up to those statements.)
One axis might be called “metaphilosophical paternalism” (a phrase I just invented, not sure if there’s an existing one I should use), i.e., how much external support / error correction do humans need, or more generally how benign of an environment do we have to place them in, before we can expect them to eventually figure out their “actual” values (which implies correctly solving all relevant philosophical dependencies such as population ethics and philosophy of consciousness) and how hard will it be to design and provide such support / error correction.
MIRI’s position seems to be that humans do need a lot of external support / error correction (see CEV) and this is a hard problem, but not so hard that it will likely turn out to be a blocking issue. Paul’s position went from his 2012 version of “indirect normativity” which envisioned placing a human in a relatively benign simulated environment (although still very different from the kinds of environments where we have historical evidence of humans being able to make philosophical progress in) to his current ideas where humans live in very hostile environments, having to process potentially adversarial messages from superintelligent AIs under time pressure.
My own thinking is that we currently know very little about metaphilosophy, essentially nothing beyond that philosophy is some kind of computational / cognitive process implemented by (at least some) human brains, and there seems to be such a thing as philosophical truth or philosophical progress, but that is hard to define or even recognize. Without easy ways to check one’e ideas (e.g., using controlled experiments or mathematical proofs), human cognitive processes tend to diverge rather than converge. (See political and religious beliefs, for example.) If typical of philosophical problems in general, understanding metaphilosophy well enough to implement something like CEV will likely take many decades of work even after someone discovers a viable approach (which we don’t yet have). Think of how confused we still are about how expected utility maximization applies in bargaining, or what priors really are or should be, many decades after those ideas were first proposed. I don’t understand Paul and MIRI’s reasons for being as optimistic as they each are on this issue.
The other axis of disagreement is how feasible it would be to create aligned AI that matches or beats unaligned AI in efficiency/capability. Here Paul is only trying to match unaligned AIs using the same mainstream AI techniques, whereas MIRI is trying to beat unaligned AIs in order to prevent them from undergoing intelligence explosion. But even Paul is more optimistic than I think is warranted. (To be fair, at least some within MIRI, such as Nate, may be aiming to beat unaligned AIs not because they’re particularly optimistic about the prospects of doing so, but because they’re pessimistic about what would happen if we merely match them.) It seems unlikely to me that alignment to complex human values comes for free. If nothing else, aligned AIs will be more complex than unaligned AIs and such complexity is costly in design, coding, maintenance, and security. Think of the security implications of having a human controller or a complex value extrapolation process at an AI’s core, compared to something simpler like a paperclip maximizer, or the continuous challenges of creating improved revisions of AI design while minimizing the risk of losing alignment to a set of complex and unknown values.
Jessica’s post lists searching for fundamental obstructions to aligned AI as a motivation for Paul’s research direction. I think given that efficient aligned AIs almost certainly exist as points in mindspace, it’s unlikely that we can find “fundamental” reasons why we can’t build them. Instead they will likely just take much more resources (including time) to build than unaligned AIs, for a host of “messy” reasons. Maybe the research can show that certain approaches to building competitive aligned AIs won’t succeed, but realistically such a result can only hope to cover a tiny part of AI design space, so I don’t see why that kind of result would be particularly valuable.
Please note that what I wrote here isn’t meant to be an argument against doing the kind of research that Paul and MIRI are doing. It’s more of an argument for simultaneously trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed. Otherwise, since those preconditions don’t seem very likely to actually obtain, we’re leaving huge amounts of potential expected value on the table if we bank on just one or even both of these approaches.
Weighing in late here, I’ll briefly note that my current stance on the difficulty of philosophical issues is (in colloquial terms) “for the love of all that is good, please don’t attempt to implement CEV with your first transhuman intelligence”. My strategy at this point is very much “build the minimum AI system that is capable of stabilizing the overall strategic situation, and then buy a whole lot of time, and then use that time to figure out what to do with the future.” I might be more optimistic than you about how easy it will turn out to be to find a reasonable method for extrapolating human volition, but I suspect that that’s a moot point either way, because regardless, thou shalt not attempt to implement CEV with humanity’s very first transhuman intelligence.
Also, +1 to the overall point of “also pursue other approaches”.
Note that Eliezer is currently more optimistic about task AGI than CEV (for the first AGI built), and I think Nate is too. I’m not sure what Benya thinks.
Oh, right, I had noticed that, and then forgot and went back to my previous model of MIRI. I don’t think Eliezer ever wrote down why he changed his mind about task AGI or how he is planning to use one. If the plan is something like “buy enough time to work on CEV at leisure”, then possibly I have much less disagreement on “metaphilosophical paternalism” with MIRI than I thought.
Consider the following strategy the AI could take:
Put a bunch of humans in a secure box containing food/housing/etc
Acquire as much power as possible while keeping the box intact
After 100 years, ask the humans in the box what to do next
There are lots of things that are unsatisfying about the proposal (e.g. the fact that only the humans in the box survive), but I’m curious which you find least satisfying (especially unsatisfying things that are also unsatisfying about Paul’s proposals). Do you think designing this AI will require solving metaphilosophical problems? Do you think this AI will be at a substantial efficiency disadvantage relative to a paperclip maximizer?
(Note that this doesn’t require humans to figure out their actual values in 100 years; they can decide some questions and kick the rest to another 100 years later)
If “a bunch” is something like 10000 smartest, most sane, most philosophically competent humans on Earth, then they might give you a reasonable answer in 100 years (depending on how teachable/heritable these things are, since the original 10000 won’t be alive at that point). But if you exclude most of humanity then most likely they’ll contribute their resources to their own AI projects so you’re starting with a small percent of power, and already losing most of potential value.
That box will be a very attractive target for other AIs to attack (e.g., by sending a manipulative message to the humans inside), attack is generally easier than defense, so keeping that box secure will be hard. One problem is how do you convey “secure” to your AI in a way that covers everything that might happen in 100 years, during which all kinds of unforeseeable technologies might be developed? Then there’s the problem that attackers only have to succeed once whereas your AI has to successfully defend against all attacks for a subjective eternity.
I think there will be strong incentives for AIs to join into coalitions and then merge into coherent unified designs (with aggregated values) because that makes them much more efficient (it gets rid of losses from redundant computations, asymmetric information, bad equilibria in general.), and also because there are likely increasing returns to scale (for example the first coalition / merged AI to find some important insight into building the next generation of AI might gain a large additional share of power at the cost of other AIs, or the strongest coalition can just fight and destroy all others and take 100% of the universe for itself). If your AI’s motivational structure is not expected utility maximization of some evaluable utility function (or whatever will be compatible with the dominant merged AIs), it might soon be forced to either self-modify into that form or lose out in this kind of coalitional race. It seems that you can either A) solve all the philosophical problems involved in safely doing this kind of merging ahead of time which will take a lot of resources (or just be impossible because we don’t know how all the mergers will work in detail), B) figure out metaphilosophy and have the AI solve those problems, or C) fail to do either and then the AI self-modifies badly or loses the coalitional race.
I think all of the things I find unsatisfying above have analogues in Paul’s proposals, and I’ve commented about them on his blog. Please let me know if I can clarify anything.
It seems to me like one person thinking for a day would do fine, and ten people thinking for ten days would do better, and so on. You seem to be imagining some bar for “good enough” which the people need to meet. I don’t understand where that bar is though—as far as I can tell, to the extent that there is a minimal bar it is really low, just “good enough to make any progress at all.”
It seems that you are much more pessimistic about the prospects of people in the box than society outside of the box, in the kind of situation we might arrange absent AI. Is that right?
Is the issue just that they aren’t much better off than society outside of the box, and you think that it’s not good to pay a significant cost without getting some significant improvement?
Is the issue that they need to do really well in order to adapt to a more complex universe dominated by powerful AI’s?
Physical security of the box seems no harder than physical security of your AI’s hardware. If physical security is maintained, then you can simply not relay any messages to the inside of the box.
The point is that in order for the AI to work it needs to implement our views about “secure” / about good deliberation, not our views about arbitrary philosophical questions. So this allows us to reduce our ambitions. It may also be too hard to build a system that has an adequate understanding of “secure,” but I don’t think that arguments about the difficulty of metaphilosophy are going to establish that. So if you grant this, it seems like you should be willing to replace “solving philosophical problems” in your arguments with “adequately assessing physical security;” is that right?
I can imagine situations where this kind of coalitional formation destroys value unless we have sophisticated philosophical tools. I consider this a separate problem from AI control; its importance depends on the expected damage done by this shortcoming.
Right now this doesn’t look like a big deal to me. That is, it looks to me like simple mechanisms will probably be good enough to capture most of the gains from coalition formation.
An example of a simple mechanism, to help indicate why I expect some simple mechanism to work well enough: if A and B have equal influence and want to compromise, then they create a powerful agent whose goal is to obtain “meaningful control” over the future and then later flip a coin to decide which of A and B gets to use that control. (With room for bargaining between A and B at some future point, using their future understanding of bargaining, prior to the coin flip. Though realistically I think bargaining in advance isn’t necessary since it can probably be done acausally after the coin flip.)
As with the last case, we’ve now moved most of the difficulty into what we mean for either A or B to have “meaningful control” of resources. We are also now required to keep both A and B secure, but that seems relatively cheap (especially if they can be represented as code). But it doesn’t look likely to me that these kinds of things are going to be serious problems that stand up to focused attempts to solve them (if we can solve other aspects of AI control, it seems very likely that we can use our solution to ensure that A or B maintains “meaningful control” over some future resources, according to an interpretation of meaningful control that is agreeable to both A and B), and I don’t yet completely understand why you are so much more concerned about it.
And if acausal trade can work rather than needing to bargain in advance, then we can probably just make the coin flip now and set aside these issues altogether. I consider that more likely than not, even moreso if we are willing to do some setup to help facilitate such trade.
Overall, it seems to me like to the extent that bargaining/coalition formation seems to be a serious issue, we should deal with it now as a separate problem (or people in the future can deal with it then as a separate problem), but that it can be treated mostly independently from AI control and that it currently doesn’t look like a big deal.
I agree that the difficulty of bargaining / coalition formation could necessitate the same kind of coordinated response as a failure to solve AI control, and in this sense the two problems are related (along with all other possible problems that might require a similar response). This post explains why I don’t think this has a huge effect on the value of AI control work, though I agree that it can increase the value of other interventions. (And could increase their value enough that they become higher priority than AI control work.)
I see two parts of the bar. One is being good enough to eventually solve all important philosophical problems. “Good enough to make any progress at all” isn’t good enough, if they’re just making progress on easy problems (or easy parts of hard problems). What if there are harder problems they need to solve later (and in this scenario all the other humans are dead)?
Another part is ability to abstain from using the enormous amount of power available until they figure out how to use it safely. Suppose after 100 years, the people in the box hasn’t figured that out yet, what fraction of all humans would vote to go back in the box for another 100 years?
An AI can create multiple copies of itself and check them against each other. It can migrate to computing substrates that are harder to attack. It can distribute itself across space and across different kinds of hardware. It can move around in space under high acceleration to dodge attacks. It can re-design its software architecture and/or use cryptographic methods to improve detection and mitigation against attacks. A box containing humans can do none of these things.
Aside from the above, I had in mind that it may be hard to define “security” so that an attacker couldn’t think of some loophole in the definition and use it to send a message into the box in a way that doesn’t trigger a security violation.
Suppose A has utility linear in resources and B has utility log in resources, then moving to “flip a coin to decide which of A and B gets to use that control” makes A no worse off but B much worse off. This changes the disagreement point (what happens if they fail to reach a deal), in a way that (intuitively speaking) greatly increases A’s bargaining power. B almostly certainly shouldn’t go for this.
A more general objection is that you’re proposing one particular way that AIs might merge, and I guess proposing to hard code that into your AI as the only acceptable way to merge, and have it reject all other proposals that don’t fit this form. This just seems really fragile. How do you know that if you only accept proposals of this form, that’s good enough to win the coalitional race during the next 100 years, or that the class of proposals your AIs will accept doesn’t leave it open to being exploited by other AIs?
So another disagreement between us that I forgot to list in my initial comment is that I think it’s unlikely we can predict all the important/time-sensitive philosophical problems that AIs will face, and solve them all ahead of time. Bargaining/coalition formation is one class of such problems, I think self-improvement is another (what does your AI do if other AIs, in order to improve their capabilities, start using new ML/algorithmic components that don’t fit into your AI control scheme?), and there are probably other problems that we can’t foresee right now.
By “good enough to make any progress at all” I meant “towards becoming smarter while preserving their values,” I don’t really care about their resolution of other object-level philosophical questions. E.g. if they can take steps towards safe cognitive enhancement, if they can learn something about how to deliberate effectively...
It seems to me like for the people to get stuck you have to actually imagine there is some particular level they reach where they can’t find any further way to self-improve. At that point we could ask about the total amount of risk in the deliberative process itself, etc., but my basic point is that the risk is about the same in the “people in a box” scenario as in any other scenario where they can deliberate.
I think many people would be happy to gradually expand and improve quality of life in the box. You could imagine that over the long run this box is like a small city, then a small country, etc., developing along whatever trajectory the people can envision that is optimally conducive to sorting things out in a way they would endorse.
Compared to the current situation, they may have some unrealized ability to significantly improve their quality of life, but it seems at best modest—you can do most of the obvious life improvement without compromising the integrity of the reflective process. I don’t really see how other aspects of their situation are problematic.
Re security:
There is some intermediate period before you can actually run an emulation of the human, after which the measures you discuss apply just as well to the humans (which still expand the attack surface, but apparently by an extremely tiny amount since it’s not much information, it doesn’t have to interact with the world, uc.). So we are discussing the total excess risk during that period. I can agree that over an infinitely long future the kinds of measures you mention are relevant, but I don’t yet see the case for this being a significant source of losses over the intermediate period.
(Obviously I expect our actual mechanisms to work much better than this, but given that I don’t understand why you would have significant concerns about this situation, it seems like we have some more fundamental disagreements.)
I don’t think we need to give a definition. I’m arguing that we can replace “can solve philosophical problems” with “understands what it means to give the box control of resources.” (Security is one aspect of giving the box control of resources, though presumably not the hardest.)
Is your claim that this concept, of letting the box control resources, is itself so challenging that your arguments about “philosophy is hard for humans” apply nearly as well to “defining meaningful control is hard for humans”? Are you referring to some other obstruction that would require us to give a precise obstruction?
It seems to me like the default is a coin flip. As long as there are unpredictable investments, a risk-neutral actor is free to keep making risky bets until they’ve either lost everything or have enough resources to win a war outright. Yes, you could prevent that by law, but if we can enforce such laws we could also subvert the formation of large coalitions. Similarly, if you have secure rights to deep space then B can guarantee itself a reasonable share of futre resources, but in that case we don’t even care who wins the coalitional race. So I don’t yet see a natural scenario where A and B have are forced to bargain but the “default” is for B to be able to secure a proportional fraction of the future.
Yes, you could propose a bargaining solution that could allow B to secure a proportional fraction of the future, but by the same token A could simply refuse to go for it.
It seems that you are concerned that our AI’s decisions may be bad due to a lack of certain kinds of philosophical understanding, and in particular that it may lose a bunch of value by failing to negotiate coalitions. I am pointing out that even given our current level of philosophical understanding, there is a wide range of plausible bargaining strategies, and I don’t see much of an argument yet that we would end up in a situation where we are at a significant disadvantage due to our lack of philosophical understanding. To get some leverage on that claim, I’m inclined to discuss a bunch of currently-plausible bargaining approaches and then to talk about why they may fall far short.
In the kinds of scenarios I am imagining, you would never do anything even a little bit like explicitly defining a class of bargaining solution and then accepting precisely those. Even in the “put humans in a box, acquire resources, give them meaningful control over those resources” we aren’t going to give a formal definition of “box,” “resources,” “meaningful control.” The whole point is just to lower the required ability to do philosophy to the ability required to implement that plan well enough to capture most of the value.
In order to argue against that, it seems like you want to say that in fact implementing that plan is very philosophically challenging. To that end, it’s great to say something like “existing bargaining strategies aren’t great, much better ones are probably possible, finding them probably requires great philosophical sophistication.” But I don’t think one can complain about hand-coding a mechanism for bargaining.
I understand your position on this. I agree that we can’t reliably predict all important/time-sensitive philosophical problems. I don’t yet see why this is a problem for my view. I feel like we are kind of going around in circles on this point; to me it feels like this is because I haven’t communicated my view, but it could also be that I am missing some apsect of your view.
To me, the existence of important/time-sensitive philosphical problems seems similar to the existence of destructive technologies. (I think destructive technologies are a much larger problem and I don’t find the argument for the likelihood of important/time-sensitive philosophy problem compelling. But my main point is the same in both cases and it’s not clear that their relative importance matters.)
I discuss these issues in this post. I’m curious whether you see as the disanalogy between these cases, or think that this argument is not valid in the case of destructive technologies either, or think that this is the wrong framing for the current discussion / you are interested in answering a different question than I am / something along those lines.
I see how expecting destructive technologies / philosophical hurdles can increase the value you place on what I called “getting our house in order,” as well as on developing remedies for particular destructive technologies / solving particular philosophical problems / solving metaphilosphy. I don’t see how it can revise our view of the value of AI control by more than say a factor of 2.
I don’t see working on metaphilosphy/philosophy as anywhere near as promising as AI control, and again I think that viewed from this perspective I don’t think you are really trying to argue for that claim (it seems like that would have to be a quantitative argument about the expected damages from lack of timely solutions to philosophical problems and about the tractability of some approach to metaphilosophy or some particular line of philosophical inquiry).
I can imagine that AI control is less promising than other work on getting our house in order. My current suspicion is that AI control is more effective, but realistically it doesn’t matter much to me because of comparative advantage considerations. If not for comparative advantage considerations I would be thinking more about the relative promise of getting our house in order, as well as other forms of capacity-building.
For philosophy, levels of ability are not comparable, because problems to be solved are not sufficiently formulated. Approximate one-day humans (as in HCH) will formulate different values from accurate ten-years humans, not just be worse at elucidating them. So perhaps you could re-implement cognition starting from approximate one-day humans, but values of the resulting process won’t be like mine.
Approximate short-lived humans may be useful for building a task AI that lets accurate long-lived humans (ems) to work on the problem, but it must also allow them to make decisions, it can’t be trusted to prevent what it considers to be a mistake, and so it can’t guard the world from AI risk posed by the long-lived humans, because they are necessary for formulating values. The risk is that “getting our house in order” outlaws philosophical progress, prevents changing things based on considerations that the risk-prevention sovereign doesn’t accept. So the scope of the “house” that is being kept in order must be limited, there should be people working on alignment who are not constrained.
I agree that working on philosophy seems hopeless/inefficient at this point, but that doesn’t resolve the issue, it just makes it necessary to reduce the problem to setting up a very long term alignment research project (initially) performed by accurate long-lived humans, guarded from current risks, so that this project can do the work on philosophy. If this step is not in the design, very important things could be lost, things we currently don’t even suspect. Messy Task AI could be part of setting up the environment for making it happen (like enforcing absence of AI or nanotech outside the research project). Your writing gives me hope that this is indeed possible. Perhaps this is sufficient to survive long enough to be able to run a principled sovereign capable of enacting values eventually computed by an alignment research project (encoded in its goal), even where the research project comes up with philosophical considerations that the designers of the sovereign didn’t see (as in this comment). Perhaps this task AI can make the first long-lived accurate uploads, using approximate short-lived human predictions as initial building blocks. Even avoiding the interim sovereign altogether is potentially an option, if the task AI is good enough at protecting the alignment research project from the world, although that comes with astronomical opportunity costs.
I didn’t think that the scenario assumed the bunch of humans in a box had access to enough industrial/technology base to do cognitive enhancement. It seems like we’re in danger of getting bogged down in details about the “people in box” scenario, which I don’t think was meant to be a realistic scenario. Maybe we should just go back to talking about your actual AI control proposals?
Here’s one: Suppose A, B, C each share 1⁄3 of the universe. If A and B join forces they can destroy C and take C’s resources, otherwise it’s a stalemate. (To make the problem easier assume C can’t join with anyone else.) Another one is A and B each have secure rights, but they need to join together to maximize negentropy.
I’m not sure what analogy you’re proposing between the two cases. Can you explain more?
I didn’t understand this claim when I first read it on your blog. Can you be more formal/explicit about what two numbers you’re comparing, that yields less than a factor of 2?
I’m happy to drop it, we seem to go around in circles on this point as well, I thought this example might be easier to agree about but I no longer think that.
Certain destructive technologies will lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from using such technologies). Certain philosophical errors might lead to bad outcomes unless we have strong coordination mechanisms (to prevent anyone from implementing philosophically unsophisticated solutions). The mechanisms that could cope with destructive technologies could also cope with philosophical problems.
You argue: there are likely to exist philosophical problems which must be solved before reaching a certain level of technological sophistication, or else there will be serious negative consequences.
I reply: your argument has at most a modest effect on the value of AI control work of the kind I advocate.
Your claim does suggest that my AI control work is less valuable. If there are hard philosophical problems (or destructive physical technologies), then we may be doomed unless we coordinate well, whether or not we solve AI control.
Here is a very crude quantitative model, to make it clear what I am talking about.
Let P1 be the probability of coordinating before the development of AI that would be catastrophic without AI control, and let P2 be the probability of coordinating before the next destructive technology / killer philosophical hurdle after that.
If there are no destructive technologies or philosophical hurdles, then the value of solving AI control is (1 - P1). If there are destructive technologies or philosophical hurdles, then the value of solving AI control is (P2 - P1). I am arguing that (P2 - P1) >= 0.5 * (1 - P1).
This comes down to the claim that P(get house in order after AI but before catastrophe | not gotten house in order prior to AI) is at least 1⁄2.
If we both believe this claim, then it seems like the disagreement between us about philosophy could at best account for a factor of 2 difference in our estimates of how valuable AI control research is (where value is measured in terms of “fraction of the universe”—if we measure value in terms of dollars, your argument could potentially decrease our value significantly, since it might suggest that other interventions could do more good and hence dollars are more valuable in terms of “fraction of the universe”).
Realistically it would account for much less though, since we can both agree that there are likely to be destructive technologies, and so all we are really doing is adjusting the timing of the next hurdle that requires coordination.
I’m not sure it’s worth arguing about this. I think that (1) these examples do only a little to increase my expectation of losses from insufficiently-sophisticated understanding of bargaining, I’m happy to argue about it if it ends up being important, but (2) it seems like the main difference is that I am looking for arguments that particular problems are costly such that it is worthwhile to work on them, while you are looking for an argument that there won’t be any costly problems. (This is related to the discussion above.)
Unlike destructive technologies, philosophical hurdles are only a problem for aligned AIs. With destructive technologies, both aligned and unaligned AIs (at least the ones who don’t terminally value destruction) would want to coordinate to prevent them and they only have to figure out how. But with philosophical problems, unaligned AIs instead want to exploit them to gain advantages over aligned AIs. For example if aligned AIs have to spend a lot of time to think about how to merge or self-improve safely (due to deferring to slow humans), unaligned AIs won’t want to join some kind of global pact to all wait for the humans to decide, but will instead move forward amongst themselves as quickly as they can. This seems like a crucial disanalogy between destructive technologies and philosophical hurdles.
This seems really high. In your Medium article you only argued that (paraphrasing) AI could be as helpful for improving coordination as for creating destructive technology. I don’t see how you get from that to this conclusion.
Unaligned AIs don’t necessarily have efficient idealized values. Waiting for (simulated) humans to decide is analogous to computing a complicated pivotal fact about unaligned AI’s values. It’s not clear that “naturally occurring” unaligned AIs have simpler idealized/extrapolated values than aligned AIs with upload-based value definitions. Some unaligned AIs may actually be on the losing side, recall the encrypted-values AI example.
Speaking for myself, the main issue is that we have no idea how to do step 3, how to tell a pre-existing sovereign what to do. A task AI with limited scope can be replaced, but an optimizer has to be able to understand what is being asked of it, and if it wasn’t designed to be able to understand certain things, it won’t be possible to direct it correctly. If in 100 years the humans come up with new principles in how the AI should make decisions (philosophical progress), it may be impossible to express these principles as directions for an existing AI that was designed without the benefit of understanding these principles.
(Of course, the humans shouldn’t be physically there, or it will be too hard to say what it means to keep them safe, but making accurate uploads and packaging the 100 years as a pure computation solves this issue without any conceptual difficulty.)
It’s not clear to me why “limited scope” and “can be replaced” are related. An agent with broad scope can still be optimizing something like “what the human would want me to do today” and the human could have preferences like “now that humans believe that an alternative design would have been better, gracefully step aside.” (And an agent with narrow scope could be unwilling to step aside if so doing would interfere with accomplishing its narrow task.)
Being able to “gracefully step aside” (to be replaced) is an example of what I meant by “limited scope” (in time). Even if AI’s scope is “broad”, the crucial point is that it’s not literally everything (and by default it is). In practice it shouldn’t be more than a small part of the future, so that the rest can be optimized better, using new insights. (Also, to be able to ask what humans would want today, there should remain some humans who didn’t get “optimized” into something else.)
I was talking specifically about algorithms that build a model of a human and then optimize over that model in order to do useful algorithmic work (e.g. modeling human translation quality and then choosing the optimal translation).
I still don’t get your position on this point, but we seem to be going around a bit in circles. Probably the most useful thing would be responding to Jessica’s hypothetical about putting humanity in a box.
I am not just looking for an aligned AI, I am looking for a transformation from (arbitrary AI) --> (aligned AI). I think the thing-to-understand is what kind of algorithms can plausibly give you AI systems without being repurposable to being aligned (with O(1) work). I think this is the kind of problem for which you are either going to get a positive or negative answer. That is, probably there is either such a transformation, or a hard-to-align design and an argument for why we can’t align it.
(This example with optimizing over human translations seems like it could well be an insurmountable obstruction, implying that my most ambitious goal is impossible.)
I believe that both of us think that what you perceive as a problem can be sidestepped (I think this is the same issue we are going in circles around).
The hope is to do a sublinear amount of additional work, not to get it for free.
It seems like we are roughly on the same page; but I am more optimistic about either discovering a positive answer or a negative answer, and so I think this approach is the highest-leveraged thing to work on and you don’t.
I think that cognitive or institutional enhancement is also a contender, as is getting our house in order, even if our only goal is dealing with AI risk.
Yes, my comment was more targeted to other people, who I’m hoping can provide their own views on these issues. (It’s kind of strange that more people haven’t commented on your ideas online. I’ve asked to be invited to any future MIRI workshops discussing them, in case most of the discussions are happening offline.)
Can you be more explicit and formal about what you’re looking for? Is it a transformation T, such that for any AI A, T(A) is an aligned AI as efficient as A, and applying T amounts to O(1) of work? (O(1) relative to what variable? The work that originally went into building A?)
If that’s what you mean, then it seems obvious that T doesn’t exist, but I don’t know how else to interpret your statement.
I don’t understand why is this disjunction true, which might be because of my confusion above. Also, even if you found a hard-to-align design and an argument for why we can’t align it, that doesn’t show that aligned AIs can’t be competitive with unaligned AIs (in order to convince others to coordinate, as Jessica wrote). The people who need convincing will just think there’s almost certainly other ways to build a competitive aligned AI that doesn’t involve transforming the hard-to-align design.
Consider some particular research program that might yield powerful AI systems, e.g. (search for better model classes for deep learning, search for improved optimization algorithms, deploy these algorithms on increasingly large hardware). For each such research program I would like to have some general recipe that takes as input the intermediate products of that program (i.e. the hardware and infrastructure, the model class, the optimization algorithms) and uses them to produce an benign AI which is competitive with the output of the research program. The additional effort and cost required for this recipe would ideally be sublinear in the effort invested in the underlying research project (O(1) was too strong).
I suspect this is possible for some research programs and not others. I expect there are some programs for which this goal is demonstrably hopeless. I think that those research programs need to be treated with care. Moreover, if I had a demonstration that a research programs was dangerous in this way, I expect that I could convince people that it needs to be treated with care.
Yes, at best someone might agree that a particular research program is dangerous/problematic. That seems like enough though—hopefully they could either be convinced to pursue other research programs that aren’t problematic, or would continue with the problematic research program and could then agree that other measures are needed to avert the risk.
If an AI causes its human controller to converge to false philosophical conclusions (especially ones relevant to their values), either directly through its own actions or indirectly by allowing adversarial messages to reach the human from other AIs, it clearly can’t be considered benign. But given our current lack of metaphilosophical understanding, how do you hope to show that any particular AI (e.g., the output of a proposed transformation/recipe) won’t cause that? Or is the plan to accept a lower burden of proof, namely assume that the AI is benign as long as no one can show that it does cause its human controller to converge to false philosophical conclusions?
If such a recipe existed for a project, that doesn’t mean it’s not problematic. For example, suppose there are projects A and B, each of which had such a recipe. It’s still possible that using intermediate products of both A and B, one could build a more efficient unaligned AI but not a competitive aligned AI (at sublinear additional cost). Similarly, if recipes didn’t exist for projects A and B individually, it might still exist for A+B. It seems like to make a meaningful statement you have to treat the entire world as one big research program. Do you agree?
My hope is to get technique A working, then get technique B working, and then get A+B working, and so on, prioritizing based on empirical guesses about what combinations will end up being deployed in practice (and hoping to develop general understanding that can be applied across many programs and combinations). I expect that in many cases, if you can handle A and B you can handle A+B, though some interactions will certainly introduce new problems. This program doesn’t have much chance of success without new abstractions.
This is definitely benign on my accounting. There is a further question of how well you do in conflict. A window is benign but won’t protect you from inputs that will drive you crazy. The hope is that if you have an AI that is benign + powerful than you may be OK.
If the agent is trying to implement deliberation in accordance with the user’s preferences about deliberation, then I want to call that benign. There is a further question of whether we mess up deliberation, which could happen with or without AI. We would like to set things up in such a way that we aren’t forced to deliberate earlier than we would otherwise want to. (And this included in the user’s preferences about deliberation, i.e. a benign AI will be trying to secure for the user the option of deliberating later, if the user believes that deliberating later is better than deliberating in concert with the AI now.)
Malign just means “actively optimizing for something bad,” the hope is to avoid that, but this doesn’t rule out other kinds of problems (e.g. causing deliberation to go badly due to insufficient competence, blowing up the world due to insufficient competence, etc.)
Overall, my current best guess is that this disagreement is better to pursue after my research program is further along, we know things like whether “benign” makes sense as an abstraction, I have considered some cases where benign agents necessarily seem to be less efficient, and so on.
I am still interested in arguments that might (a) convince me to not work on this program, e.g. because I should be working on alternative social solutions, or (b) convince others to work on this program, e.g. because they currently don’t see how it could succeed but might work on it if they did, or (c) which clarify the key obstructions for this research program.
I really agree with #2 (and I think with #1, as well, but I’m not as sure I understand your point there).
I’ve been trying to convince people that there will be strong trade-offs between safety and performance, and have been surprised that this doesn’t seem obvious to most… but I haven’t really considered that “efficient aligned AIs almost certainly exist as points in mindspace”. In fact I’m not sure I agree 100% (basically because “Moloch” (http://slatestarcodex.com/2014/07/30/meditations-on-moloch/)).
I think “trying to find and pursue other approaches to solving the “AI risk” problem, especially ones that don’t require the same preconditions in order to succeed” remains perhaps the most important thing to do; do you have anything in particular in mind? Personally, I tend to think that we ought to address the coordination problem head-on and attempt a solution before AGI really “takes off”.
What do you see as the best arguments for this claim? I haven’t seen much public argument for it and am definitely interested in seeing more. I definitely grant that it’s prima facie plausible (as is the alternative).
Some caveats:
It’s obvious there are trade-offs between safety and performance in the usual sense of “safety.” But we are interested in a special kind of failure, where a failed system ends up controlling a significant share of the entire universe’s resources (rather than e.g. causing an explosion), and it’s less obvious that preventing such failures necessarily requires a significant cost.
Its also obvious that there is an additional cost to be paid in order to solve control, e.g. consider the fact that we are currently spending time on it. But the question is how much additional work needs to be done. Does building aligned systems require 1000% more work? 10%? 0.1%? I don’t see why it should obvious that this number is on the order of 100% rather than 1%.
Similarly for performance costs. I’m willing to grant that an aligned system will be more expensive to run. But is that cost an extra 1000% or an extra 0.1%? Both seem quite plausible. From a theoretical perspective the question is whether the required overhead is linear or sublinear?
I haven’t seen strong arguments for the “linear overhead” side, and my current guess is that the answer is sublinear. But again, both positions seem quite plausible.
(There are currently a few major obstructions to my approach that could plausibly give a tight theoretical argument for linear overhead, such as the translation example in the discussion with Wei Dai. In the past such obstructions have ended up seeming surmountable, but I think that it is totally plausible that eventually one won’t. And at that point I hope to be able to make clean statements about exactly what kind of thing we can’t hope to do efficiently+safely / exactly what kinds of additional assumptions we would have to make / what the key obstructions are).
I think this is a good idea and a good project, which I would really like to see more people working on. In the past I may have seemed more dismissive and if so I apologize for being misguided. I’ve spent a little bit of time thinking about it recently and my feeling is that there is a lot of productive and promising work to do.
My current guess is that AI control is the more valuable thing for me personally to do though I could imagine being convinced out of this.
I feel that AI control is valuable given that (a) it has a reasonable chance of succeeding even if we can’t solve these coordination problems, and (b) convincing evidence that the problem is hard would be a useful input into getting the AI community to coordinate.
If you managed to get AI researchers to effectively coordinate around conditionally restricting access to AI (if it proved to be dangerous), then that would seriously undermine argument (b). I believe that a sufficiently persuasive/charismatic/accomplished person could probably do this today.
If I ended up becoming convinced that AI control was impossible this would undermine argument (a) (though hopefully that impossibility argument could itself be used to satisfy desiderata (b)).
My model of Nate thinks the path to victory goes through the aligned AI project gaining a substantial first mover advantage (through fast local takeoff, more principled algorithms, and/or better coordination). Though he’s also quite concerned about extremely large efficiency disadvantages of aligned AI vs unaligned AI (e.g. he’s pessimistic about act-based agents helping much because they might require the AI to be good at predicting humans doing complex tasks such as research).
In this case I expect that in <10 years we get something like: “we tried making aligned versions of a bunch of algorithms, but the aligned versions are always less powerful because they left out some source of power the unaligned versions had. We iterated the process a few times (studying the additional sources of power and making aligned versions of them), and this continued to be the case. We have good reasons to believe that there isn’t a sensible stopping point to this process.” This seems pretty close to a fundamental obstruction and it seems like it would be similarly useful, especially if the “good reasons to believe there isn’t a sensible stopping point to this process” tell us something new about which relaxations are promising.
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I’d paraphrase what he’s said as:
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
Which I emphatically agree with.
I don’t see this as being the case. As Vadim pointed out, we don’t even know what we mean by “aligned versions” of algos, ATM. So we wouldn’t know if we’re succeeding or failing (until it’s too late and we have a treacherous turn).
It looks to me like Wei Dai shares my views on “safety-performance trade-offs” (grep it here: http://graphitepublications.com/the-beginning-of-the-end-or-the-end-of-beginning-what-happens-when-ai-takes-over/).
I’d paraphrase what he’s said as:
“Orthogonality implies that alignment shouldn’t cost performance, but says nothing about the costs of ‘value loading’ (i.e. teaching an AI human values and verifying its value learning procedure and/or the values it has learned). Furthermore, value loading will probably be costly, because we don’t know how to do it, competitive dynamics make the opportunity cost of working on it large, and we don’t even have clear criteria for success.”
Which I emphatically agree with.
Even beyond Jessica’s point (that failure to improve our understanding would constitute an observable failure), I don’t completely buy this.
We are talking about AI safety because there are reasons to think that AI systems will cause a historically unprecedented kind of problem. If we could design systems for which we had no reason to expect them to cause such problems, then we can rest easy.
I don’t think there is some kind of magical and unassailable reason to be suspicious of powerful AI systems, there are just a bunch of particular reasons to be concerned.
Similarly, there is no magical reason to expect a treacherous turn—this is one of the kinds of unusual failures which we have reason to be concerned about. If we built a system for which we had no reason to be concerned, then we shouldn’t be concerned.
I think the core of our differences is that I see minimally constrained, opaque, utility-maximizing agents with good models of the world and access to rich interfaces (sensors and actuators) as extremely likely to be substantially more powerful than what we will be able to build if we start degrading any of these properties.
These properties also seem sufficient for a treacherous turn (in an unaligned AI).
The only point on which there is plausible disagreement is “utility-maximizing agents.” On a narrow reading of “utility-maximizing agents” it is not clear why it would be important to getting more powerful performance.
On a broad reading of “utility-maximizing agents” I agree that powerful systems are utility-maximizing. But if we take a broad reading of this property, I don’t agree with the claim that we will be unable to reliably tell that such agents aren’t dangerous without theoretical progress.
In particular, there is an argument of the form “the prospect of a treacherous turn makes any informal analysis unreliable.” I agree that the prospect of a treacherous turn makes some kinds of informal analysis unreliable. But I think it is completely wrong that it makes all informal analysis unreliable, I think that appropriate informal analysis can be sufficient to rule out the prospect of a treacherous turn. (Most likely an analysis that keeps track of what is being optimized, and rules out the prospect that an indicator was competently optimized to manipulate our understanding of the current situation.)
Paul, I’m not sure I understand what you’re saying here. Can you imagine a system “for which we had no reason to expect it to cause such problems” without an underlying mathematical theory that shows why this system is safe?
The reason AI systems will cause a historically unprecedented kind of problem, is that AI systems can outsmart humans and thus create situations that are outside our control, even when we don’t a priori see the precise mechanism by which we will lose control. In order for such a system be safe, we need to know that it will not attempt anything detrimental to us, and we need to know this as an abstraction, i.e without knowing in details what the system will do (because the system is superintelligent so we by definition we cannot guess its actions).
Doesn’t it seem improbable to you that we will have a way of having such knowledge by some other means than the accuracy of mathematical thought?
That is, we can have a situation like “AI running in homomorphic encryption with a quantum-generated key that is somewhere far from the AI’s computer” where it’s reasonable claim that the AI is safe as long as it stays encrypted (even though there is still some risk from being wrong about cryptographic conjectures or the AI exploiting some surprising sort of unknown physics), without using a theory of intelligence at all (beyond the fact that intelligence is a special case of computation), but it seems unlikely that we can have something like this while simultaneously having the AI powerful enough to protect us against other AIs that are malicious.
Yes. For example, suppose we built a system whose behavior was only expected to be intelligent to the extent that it imitated intelligent human behavior—for which there is no other reason to believe that it is intelligent. Depending on the human being imitated, such a system could end up seeming unproblematic even without any new theoretical understanding.
We don’t yet see any way to build such a system, much less to do so in a way that could be competitive with the best RL system that could be designed at a given level of technology. But I can certainly imagine it.
(Obviously I think there is a much larger class of systems that might be non-problematic, though it may depend on what we mean by “underlying mathematical theory.”)
This doesn’t seem sufficient for trouble. Trouble only occurs when those systems are effectively optimizing for some inhuman goals, including e.g. acquiring and protecting resources.
That is a very special thing for a system to do, above and beyond being able to accomplish tasks that apparently require intelligence. Currently we don’t have any way to accomplish the goals of AI that don’t risk this failure mode, but it’s not obvious that it is necessary.
This doesn’t seem to be a valid example: your system is not superintelligent, it is “merely” human. That is, I can imagine solving AI risk by building whole brain emulations with enormous speed-up and using them to acquire absolute power. However:
I think this is not what is usually meant by “solving AI alignment.”
The more you use heuristic learning algorithms instead of “classical” brain emulation the more I would be worried your algorithm does something subtly wrong in a way that distorts values, although that would also invalidate the condition that “there is no other reason to believe that it is intelligent.”
There is a high-risk zone here where someone untrustworthy can gain this technology and use it to unwittingly create unfriendly AI.
Well, any AI is effectively optimizing for some goal by definition. How do you know this goal is “human”? In particular, if your AI is supposed to defend us from other AIs, it is very much in the business of acquiring and protecting resources.
If we fail to make the intuition about aligned versions of algorithms more crisp than it currently is, then it’ll be pretty clear that we failed. It seems reasonable to be skeptical that we can make our intuitions about “aligned versions of algorithms” crisp and then go on to design competitive and provably aligned versions of all AI algorithms in common use. But it does seem like we will know if we succeed at this task, and even before then we’ll have indications of progress such as success/failure at formalizing and solving scalable AI control in successively complex toy environments. (It seems like I have intuitions about what would constitute progress that are hard to convey over text, so I would not be surprised if you aren’t convinced that it’s possible to measure progress).
It seems like “value loading is very hard/costly” has to imply that the proposal in this comment thread is going to be very hard/costly, e.g. because one of Wei Dai’s objections to it proves fatal. But it seems like arguments of the form “human values are complex and hard to formalize” or “humans don’t know what we value” are insufficient to establish this; Wei Dai’s objections in the thread are mostly not about value learning. (sorry if you aren’t arguing “value loading is hard because human values are complex and hard to formalize” and I’m misinterpreting you)