Can you be more specific about the theoretical bottlenecks that seem most important?
Type signature of human values is the big one. I think it’s pretty clear at this point that utility functions aren’t the right thing, that we value things “out in the world” as opposed to just our own “inputs” or internal state, that values are not reducible to decisions or behavior, etc. We don’t have a framework for what-sort-of-thing human values are. If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
The key question is which problem is easier: the alignment problem, or the safe-use-of-dangerous-tools problem. All else equal, if you think the alignment problem is hard, then you should be more willing to replace alignment work with tool safety work. If you think the alignment problem is easy, you should discourage dangerous tools in favor of frontloaded work on a more paternalistic “not just benign, actually aligned” AI.
A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that’s true, alignment work and tool safety work need to be basically the same thing.
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we’re doing the same things but faster, it’s not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)
In some sense I think the argument for paternalism is self-refuting, because the argument is essentially that humans can’t be trusted, but I’m not sure the total amount of responsibility we’re assigning to humans has changed—if the first system is to be very paternalistic, that puts an even greater weight of responsibility on the shoulders of its designers to be sure and get it right.
This in particular I think is a strong argument, and the die-rolls argument is my main counterargument.
We can indeed partially avoid the die-rolls issue by only using the system a limited number of times—e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We’d be rolling the dice twice—once in designing the first system, once in using the first system to design the second—and that second die-roll in particular has a lot of unknown unknowns packed into it.
Let’s make the later AIs corrigible then. Perhaps our initial AI can give us both a corrigibility oracle and a values oracle. (Or later AIs could use some other approach to corrigibility.)
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
There’s an aspect of defense-in-depth here. If your tool’s model of human values is slightly imperfect, that doesn’t necessarily fail hard the way an agent with a model of human values that’s slightly imperfect does.
BTW, let’s talk about the “Research Assistant” story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
That’s why you need a tool… so it can tell you the unknown unknowns you’re missing, and how to solve them. We’d rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn’t we? ;-) Shouldn’t we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API.
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn’t want to get locked into a merely OK solution for all of eternity. I’m saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I’m assuming corrigibility is solvable to an OK degree as well.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say “corrigibility” has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There’s no reason we can’t make the system’s notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don’t think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent’s reward function say. But that’s a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it’s discovered and how they differ so we can figure out which notion(s) we want to use.
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
It’s not the function-representation that’s the problem, it’s the type-signature of the function. I don’t know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.
All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
I don’t think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don’t think that designing a friendly AI is too complex for humans.
Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities.
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.
Type signature of human values is the big one. I think it’s pretty clear at this point that utility functions aren’t the right thing, that we value things “out in the world” as opposed to just our own “inputs” or internal state, that values are not reducible to decisions or behavior, etc. We don’t have a framework for what-sort-of-thing human values are. If we had that—not necessarily a full model of human values, just a formalization which we were confident could represent them—then that would immediately open the gates to analysis, to value learning, to uncertainty over values, etc.
A good argument, but I see the difficulties of safe tool AI and the difficulties of alignment as mostly coming from the same subproblem. To the extent that that’s true, alignment work and tool safety work need to be basically the same thing.
On the tools side, I assume the tools will be reasoning about systems/problems which humans can’t understand—that’s the main value prop in the first place. Trying to collapse the complexity of those systems into a human-understandable API is inherently dangerous: values are complex, the system is complex, their interaction will inevitably be complex, so any API simple enough for humans will inevitably miss things. So the only safe option which can scale to complex systems is to make sure the “tools” have their own models of human values, and use those models to check the safety of their outputs… which brings us right back to alignment.
Simple mechanisms like always displaying an estimated probability that I’ll regret asking a question would probably help, but I’m mainly worried about the unknown unknowns, not the known unknowns. That’s part of what I mean when I talk about marginal improvements vs closing the bulk of the gap—the unknown unknowns are the bulk of the gap.
(I could see tools helping in a do-the-same-things-but-faster sort of way, and human-mimicking approaches in particular are potentially helpful there. On the other hand, if we’re doing the same things but faster, it’s not clear that that scenario really favors alignment research over the Leeroy Jenkins of the world.)
This in particular I think is a strong argument, and the die-rolls argument is my main counterargument.
We can indeed partially avoid the die-rolls issue by only using the system a limited number of times—e.g. to design another system. That said, in order for the first system to actually add value here, it has to do some reasoning which is too complex for humans—which brings back the problem from earlier, about the inherent danger of collapsing complex values and systems into a simple API. We’d be rolling the dice twice—once in designing the first system, once in using the first system to design the second—and that second die-roll in particular has a lot of unknown unknowns packed into it.
I have yet to see a convincing argument that corrigibility is any easier than alignment itself. It seems to suffer from the same basic problem: the concept of “corrigibility” has a lot of hidden complexity, especially when it interacts with embeddedness. To the extent that we’re relying on corrigibility, I’d ideally like it to improve with capabilities, in the same way and for the same reasons as I’d like alignment to improve with capabilities. Do you know of an argument that it’s easier?
Do you have in mind a specific aspect of human values that couldn’t be represented using, say, the reward function of a reinforcement learning agent AI?
There’s an aspect of defense-in-depth here. If your tool’s model of human values is slightly imperfect, that doesn’t necessarily fail hard the way an agent with a model of human values that’s slightly imperfect does.
BTW, let’s talk about the “Research Assistant” story here. See more discussion here. (The problems brought up in that thread seem pretty solvable to me.)
That’s why you need a tool… so it can tell you the unknown unknowns you’re missing, and how to solve them. We’d rather have a single die roll, on creating a good tool, then have a separate die roll for every one of those unknown unknowns, wouldn’t we? ;-) Shouldn’t we aim for a fairly minimalist, non-paternalistic tool where unknown unknowns are relatively unlikely to become load-bearing? All we need to do is figure out the unknown unknowns that are load-bearing in the Research Assistant scenario, then assistant can help us with the rest of the unknown unknowns.
If solving FAI necessarily involves reasoning about things which are beyond humans (which seems to be what you’re getting at with the “unknown unknowns” stuff), what is the alternative?
We were discussing a scenario where we had an OK solution to alignment, and you were saying that you didn’t want to get locked into a merely OK solution for all of eternity. I’m saying corrigibility can address that. Alignment is already solvable to an OK degree in this hypothetical, so I’m assuming corrigibility is solvable to an OK degree as well.
Corrigible AI should be able to improve its corrigibility with increased capabilities the same way it can improve its alignment with increased capabilities. You say “corrigibility” has a lot of hidden complexity. The more capable the system, the more hypotheses it can generate regarding complex phenomena, and the more likely those hypotheses are to be correct. There’s no reason we can’t make the system’s notion of corrigibility corrigible in the same way its values are corrigible. (BTW, I don’t think corrigibility even necessarily needs to be thought of as separate from alignment, you can think of them as both being reflected in an agent’s reward function say. But that’s a tangent.) And we can leverage capability increases by having the system explain various notions of corrigibility it’s discovered and how they differ so we can figure out which notion(s) we want to use.
It’s not the function-representation that’s the problem, it’s the type-signature of the function. I don’t know what such a function would take in or what it would return. Even RL requires that we specify the input-output channels up-front.
This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
I don’t think solving FAI involves reasoning about things beyond humans. I think the AIs themselves will need to reason about things beyond humans, and in particular will need to reason about complex safety problems on a day-to-day basis, but I don’t think that designing a friendly AI is too complex for humans.
Much of the point of AI is that we can design systems which can reason about things too complex for ourselves. Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.