Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don’t expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.)
I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it’s basically like taking an alignment researcher and running them on faster hardware. That doesn’t directly solve any of the underlying conceptual problems—it just punts them to the simulated researchers—but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems!
Suppose we restrict ourselves to feeding our system data from before the year 2000. There should be a decent representation of human values to be learned from this data, yet it should be quite difficult to figure out the specifics of the 2020+ data-collection process from it.
I don’t think this helps much. Two examples of “specifics of the data collection process” to illustrate:
Suppose our data consists of human philosophers’ writing on morality. Then the “specifics of the data collection process” includes the humans’ writing skills and signalling incentives, and everything else besides the underlying human values.
Suppose our data consists of humans’ choices in various situations. Then the “specifics of the data collection process” includes the humans’ mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying human values.
So “specifics of the data collection process” is a very broad notion in this context. Essentially all practical data sources will include a ton of extra information besides just their information on human values.
Second, one way to check on things is to deliberately include a small quantity of mislabeled data, then once the system is done learning, check whether its model correctly recognizes that the mislabeled data is mislabeled (and agrees with all data that is correctly labeled).
I like this idea, and I especially like it in conjunction with deliberate noise as an unsupervised learning trick. I’ll respond more to that on the other comment.
A third way which you don’t mention is to use the initial aligned AI as a “human values oracle” for subsequent AIs.
I have mixed feelings on this.
My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I’d really like it to be able to refine its notion of human values over time. In other words, the oracle’s notion of human values may be accurate but not precise, and I’d like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.
That said, as long as the oracle’s alignment is accurate, we could use your suggestion to make sure that actions are OK for all possible human-values-notions within uncertainty. That’s probably at least good enough to avoid disaster. It would still fall short of the full potential value of AI—there’d be missed opportunities, where the system has to be overly careful because its notion human values is insufficiently precise—but at least no disaster.
Finally, on deceptive behavior: I use the phrase a bit differently than I think most people do these days. My prototypical image isn’t of a mesa-optimizer. Rather, I imagine people iteratively developing a system, trying things out, keeping things which seem to work, and thereby selecting for things which look good to humans (regardless of whether they’re actually good). In that situation, we’d expect the system to end up doing things which look good but aren’t, because the human developers accidentally selected for that sort of behavior. It’s a “you-get-what-you-measure” problem, rather than a mesa-optimizers problem.
I don’t expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.
Can you be more specific about the theoretical bottlenecks that seem most important?
I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away).
I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic “not just benign, actually aligned” AI.
An analogy here would be Linux vs Windows. Linux lets you shoot your foot off and wipe your hard drive with a single command, but it also gives you greater control of your system and your computer is less likely to get viruses. Windows is safer and more paternalistic, with less user control. Windows is a better choice for the average user, but that’s partially because we have a lot of experience building operating systems. It wouldn’t make sense to aim for a Windows as our first operating system, because (a) it’s a more ambitious project and (b) we wouldn’t have enough experience to know the right ways in which to be paternalistic. Heck, it was you who linked disparagingly to waterfall-style software development the other day :) There’s a lot to be said for simplicity of implementation.
(Random aside: In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can’t be trusted, but I’m not sure the total amount of responsibility we’re assigning to humans has changed—if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right. I’d rather shove responsibility into the post-singularity world, because the current world seems non-ideal, for example, AI designers have limited time to think due to e.g. possible arms races.)
What do I mean by the “safe-use-of-dangerous-tools problem”? Well, many dangerous tools will come with an instruction manual or mandatory training in safe tool use. For a tool AI, this manual might include things like:
Before asking the AI any question, ask: “If I ask Question X, what is the estimated % chance that I will regret asking on reflection?”
Tell the AI: “When you answer this question, instead of revealing any information you think will plausibly harm me, replace it with [I’m not revealing this because it could plausibly harm you]”
If using a human-simulation approach to alignment, tell your AI to only make use of the human-simulation to inform terminal values, never instrumental values. Or give the human simulation loads of time to reflect, so it’s effectively a speed superintelligence (assuming for the moment what seems to be a common AI safety assumption that more reflection always improves outcomes—skepticism here). Or make sure the simulated human has access to the safety manual.
I think it’s possible to do useful work on the manual for the Tool AI even in the absence of any actual Tool AI having been created. In fact, I suspect this work will generalize better between different AI designs than most alignment work generalizes between designs.
Insights from our manual could even be incorporated into the user interface for the tool. For example, the question-asking flow could by default show us the answer to the question “If I ask Question X, what is the estimated % chance that I will regret asking on reflection?” and ask us to read the result and confirm that the question is actually one we want to ask. This would be analogous to alias rm='rm -i' in Linux—it doesn’t reduce transparency or add brittle complexity, but it does reduce the risk of shooting ourselves in the foot.
BTW you wrote:
Coming at it from a different angle: if a safety problem is handled by a system’s designer, then their die-roll happens once up-front. If that die-roll comes out favorably, then the system is safe (at least with respect to the problem under consideration); it avoids the problem by design. On the other hand, if a safety problem is left to the system’s users, then a die-roll happens every time the system is used, so inevitably some of those die rolls will come out unfavorably. Thus the importance of designing AI for safety up-front, rather than relying on users to use it safely.
One possible plan for the tool is to immediately use it to create a more paternalistic system (or just generate a bunch of UI safeguards as I described above). So then you’re essentially just rolling the dice once.
Two examples of “specifics of the data collection process” to illustrate
From my perspective, these examples essentially illustrate that there’s not a single natural abstraction for “human values”—but as I said elsewhere, I think that’s a solvable problem.
My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I’d really like it to be able to refine its notion of human values over time. In other words, the oracle’s notion of human values may be accurate but not precise, and I’d like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.
Let’s make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)
Can you be more specific about the theoretical bottlenecks that seem most important?
Type signature of human values is the big one. I think it’s pretty clear at this point that utility functions aren’t the right thing, that we value things “out in the world” as opposed to just our own “inputs” or internal state, that values are not reducible to decisions or behavior, etc. We don’t have a framework for what-sort-of-thing human values are. If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic “not just benign, actually aligned” AI.
A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that’s true, alignment work and tool safety work need to be basically the same thing.
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we’re doing the same things but faster, it’s not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)
In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can’t be trusted, but I’m not sure the total amount of responsibility we’re assigning to humans has changed—if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right.
This in particular I think is a strong argument, and the die-rolls argument is my main counterargument.
We can indeed partially avoid the die-rolls issue by only using the system a limited number of times—e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We’d be rolling the dice twice—once in designing the first system, once in using the first system to design the second—and that second die-roll in particular has a lot of unknown unknowns packed into it.
Let’s make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
There’s an aspect of defense-in-depth here. If your tool’s model of human values is slightly imperfect, that doesn’t necessarily fail hard the way an agent with a model of human values that’s slightly imperfect does.
BTW, let’s talk about the “Research Assistant” story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
That’s why you need a tool… so it can tell you the unknown unknowns you’re missing, and how to solve them. We’d rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn’t we? ;-) Shouldn’t we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API.
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn’t want to get locked into a merely OK solution for all of eternity. I’m saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I’m assuming corrigibility is solvable to an OK degree as well.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say “corrigibility” has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There’s no reason we can’t make the system’s notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don’t think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent’s reward function say. But that’s a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it’s discovered and how they differ so we can figure out which notion(s) we want to use.
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
It’s not the function-representation that’s the problem, it’s the type-signature of the function. I don’t know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.
All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
I don’t think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don’t think that designing a friendly AI is too complex for humans.
Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities.
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.
Thanks for the comments, these are excellent!
Valid complaint on the title, I basically agree. I only give the path outlined in the OP ~10% of working without any further intervention by AI safety people, and I definitely agree that there are relatively-tractable-seeming ways to push that number up on the margin. (Though those would be marginal improvements only; I don’t expect them to close the bulk of the gap without at least some progress on theoretical bottlenecks.)
I am generally lukewarm about human-simulation approaches to alignment; the fusion power generator scenario is a prototypical example of my concerns here (also see this comment on it, which explains what I see as the key take-away). The idea of simulating a human doing moral philosophy is a bit different than what I usually imagine, though; it’s basically like taking an alignment researcher and running them on faster hardware. That doesn’t directly solve any of the underlying conceptual problems—it just punts them to the simulated researchers—but it is presumably a strict improvement over a limited number of researchers operating slowly in meatspace. Alignment research ems!
I don’t think this helps much. Two examples of “specifics of the data collection process” to illustrate:
Suppose our data consists of human philosophers’ writing on morality. Then the “specifics of the data collection process” includes the humans’ writing skills and signalling incentives, and everything else besides the underlying human values.
Suppose our data consists of humans’ choices in various situations. Then the “specifics of the data collection process” includes the humans’ mistaken reasoning, habits, divergence of decision-making from values, and everything else besides the underlying human values.
So “specifics of the data collection process” is a very broad notion in this context. Essentially all practical data sources will include a ton of extra information besides just their information on human values.
I like this idea, and I especially like it in conjunction with deliberate noise as an unsupervised learning trick. I’ll respond more to that on the other comment.
I have mixed feelings on this.
My main reservation is that later AIs will never be more precisely aligned the oracle. That first AI may be basically-correctly aligned, but it still only has so much data and probably only rough algorithms, so I’d really like it to be able to refine its notion of human values over time. In other words, the oracle’s notion of human values may be accurate but not precise, and I’d like precision to improve as more data comes in and better algorithms are found. This is especially important if capabilities rise over time and greater capabilities require more precise alignment.
That said, as long as the oracle’s alignment is accurate, we could use your suggestion to make sure that actions are OK for all possible human-values-notions within uncertainty. That’s probably at least good enough to avoid disaster. It would still fall short of the full potential value of AI—there’d be missed opportunities, where the system has to be overly careful because its notion human values is insufficiently precise—but at least no disaster.
Finally, on deceptive behavior: I use the phrase a bit differently than I think most people do these days. My prototypical image isn’t of a mesa-optimizer. Rather, I imagine people iteratively developing a system, trying things out, keeping things which seem to work, and thereby selecting for things which look good to humans (regardless of whether they’re actually good). In that situation, we’d expect the system to end up doing things which look good but aren’t, because the human developers accidentally selected for that sort of behavior. It’s a “you-get-what-you-measure” problem, rather than a mesa-optimizers problem.
Can you be more specific about the theoretical bottlenecks that seem most important?
I agree that Tool AI is not inherently safe. The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic “not just benign, actually aligned” AI.
An analogy here would be Linux vs Windows. Linux lets you shoot your foot off and wipe your hard drive with a single command, but it also gives you greater control of your system and your computer is less likely to get viruses. Windows is safer and more paternalistic, with less user control. Windows is a better choice for the average user, but that’s partially because we have a lot of experience building operating systems. It wouldn’t make sense to aim for a Windows as our first operating system, because (a) it’s a more ambitious project and (b) we wouldn’t have enough experience to know the right ways in which to be paternalistic. Heck, it was you who linked disparagingly to waterfall-style software development the other day :) There’s a lot to be said for simplicity of implementation.
(Random aside: In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can’t be trusted, but I’m not sure the total amount of responsibility we’re assigning to humans has changed—if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right. I’d rather shove responsibility into the post-singularity world, because the current world seems non-ideal, for example, AI designers have limited time to think due to e.g. possible arms races.)
What do I mean by the “safe-use-of-dangerous-tools problem”? Well, many dangerous tools will come with an instruction manual or mandatory training in safe tool use. For a tool AI, this manual might include things like:
Before asking the AI any question, ask: “If I ask Question X, what is the estimated % chance that I will regret asking on reflection?”
Tell the AI: “When you answer this question, instead of revealing any information you think will plausibly harm me, replace it with [I’m not revealing this because it could plausibly harm you]”
If using a human-simulation approach to alignment, tell your AI to only make use of the human-simulation to inform terminal values, never instrumental values. Or give the human simulation loads of time to reflect, so it’s effectively a speed superintelligence (assuming for the moment what seems to be a common AI safety assumption that more reflection always improves outcomes—skepticism here). Or make sure the simulated human has access to the safety manual.
I think it’s possible to do useful work on the manual for the Tool AI even in the absence of any actual Tool AI having been created. In fact, I suspect this work will generalize better between different AI designs than most alignment work generalizes between designs.
Insights from our manual could even be incorporated into the user interface for the tool. For example, the question-asking flow could by default show us the answer to the question “If I ask Question X, what is the estimated % chance that I will regret asking on reflection?” and ask us to read the result and confirm that the question is actually one we want to ask. This would be analogous to
alias rm='rm -i'
in Linux—it doesn’t reduce transparency or add brittle complexity, but it does reduce the risk of shooting ourselves in the foot.BTW you wrote:
One possible plan for the tool is to immediately use it to create a more paternalistic system (or just generate a bunch of UI safeguards as I described above). So then you’re essentially just rolling the dice once.
From my perspective, these examples essentially illustrate that there’s not a single natural abstraction for “human values”—but as I said elsewhere, I think that’s a solvable problem.
Let’s make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)
Type signature of human values is the big one. I think it’s pretty clear at this point that utility functions aren’t the right thing, that we value things “out in the world” as opposed to just our own “inputs” or internal state, that values are not reducible to decisions or behavior, etc. We don’t have a framework for what-sort-of-thing human values are. If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that’s true, alignment work and tool safety work need to be basically the same thing.
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we’re doing the same things but faster, it’s not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)
This in particular I think is a strong argument, and the die-rolls argument is my main counterargument.
We can indeed partially avoid the die-rolls issue by only using the system a limited number of times—e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We’d be rolling the dice twice—once in designing the first system, once in using the first system to design the second—and that second die-roll in particular has a lot of unknown unknowns packed into it.
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
There’s an aspect of defense-in-depth here. If your tool’s model of human values is slightly imperfect, that doesn’t necessarily fail hard the way an agent with a model of human values that’s slightly imperfect does.
BTW, let’s talk about the “Research Assistant” story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)
That’s why you need a tool… so it can tell you the unknown unknowns you’re missing, and how to solve them. We’d rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn’t we? ;-) Shouldn’t we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn’t want to get locked into a merely OK solution for all of eternity. I’m saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I’m assuming corrigibility is solvable to an OK degree as well.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say “corrigibility” has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There’s no reason we can’t make the system’s notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don’t think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent’s reward function say. But that’s a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it’s discovered and how they differ so we can figure out which notion(s) we want to use.
It’s not the function-representation that’s the problem, it’s the type-signature of the function. I don’t know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
I don’t think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don’t think that designing a friendly AI is too complex for humans.
Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.