This translates in my head to “all we need to do is solve the main problems of alignment, and then we’ll have an assistant which can help us clean up any easy loose ends”.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
More generally: I’m certainly open to the idea of AI, of one sort or another, helping to work out at least some of the problems of alignment. (Indeed, that’s very likely a component of any trajectory where alignment improves over time.) But I have yet to hear a convincing case that punting now actually makes long-run alignment more likely, or even that future tools will make creation of aligned AI easier/more likely relative to unaligned AI. What exactly is the claim here?
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
Similarly, I expect we can design safe systems which can reason about safety problems too complex for ourselves.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
What notion of “corrigible” are you using here? It sounds like it’s not MIRI’s “the AI won’t disable its own off-switch” notion.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.
Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment? This claim seems a bit odd to me because I don’t think those problems are highlighted in any of the major AI alignment research agenda papers. (Alternatively, if you feel like there are important omissions from those answers, I strongly encourage you to write your own answer!)
I did a presentation at the recent AI Safety Discussion Day on how to solve the problems in that thread. My proposed solutions don’t look much like anything that’s e.g. on Arbital because the problems are different. I can share the slides if you want, PM me your gmail address.
Here’s an example of a tool that I would find helpful right now, that seems possible to make with current technology (and will get better as technology advances), and seems very low risk: Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal. (EDIT: Or, given some AI safety problem, highlight the contiguous passage of text which is most likely to represent a solution.)
Can you come up with improbable scenarios in which this sort of thing ends up being net harmful? Sure. But security is not binary. Just because there is some hypothetical path to harm doesn’t mean harm is likely.
Could this kind of approach be useful for unaligned AI as well? Sure. So begin work on it ASAP, keep it low profile, and restrict its use to alignment researchers in order to create maximum differential progress towards aligned AI.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
(BTW, I just want to clarify that we’re having two parallel discussions here: One discussion is about what we should be doing very early in our AI safety gameplan, e.g. creating the assistant I described that seems like it would be useful right now. Another discussion is about how to prevent a failure mode that could come about very late in our AI safety gameplan, where we have a sorta-aligned AI and we don’t want to lock ourselves into an only sorta-optimal universe for all eternity. I expect you realize this, I’m just stating it explicitly in order to make the discussion a bit easier to follow.)
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.