Try to clarify here, do you think the problems brought up in these answers are the main problems of alignment?
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
Given a textual description of some FAI proposal (or proposal for solving some open problem within AI safety), highlight the contiguous passage of text within the voluminous archives of AF/LW/etc. that is most likely to represent a valid objection to this proposal.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
I’m a bit confused why you’re bringing up “safety problems too complex for ourselves” because it sounds like you don’t think there are any important safety problems like that, based on the sentences that came before this one?
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
I’m talking about the broad sense of “corrigible” described in e.g. the beginning of this post.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.
Mostly no. I’ve been trying to write a bit more about this topic lately; Alignment as Translation is the main source of my intuitions on core problems, and the fusion power generator scenario is an example of what that looks like in a GPT-like context (parts of your answer here are similar to that).
Using GPT-like systems to simulate alignment researchers’ writing is a probably-safer use-case, but it still runs into the core catch-22. Either:
It writes something we’d currently write, which means no major progress (since we don’t currently have solutions to the major problems and therefore can’t write down such solutions), or
It writes something we currently wouldn’t write, in which case it’s out-of-distribution and we have to worry about how it’s extrapolating us
I generally expect the former to mostly occur by default; the latter would require some clever prompts.
I could imagine at least some extrapolation of progress being useful, but it still seems like the best way to make human-simulators more useful is to improve our own understanding, so that we’re more useful to simulate.
This sounds like a great tool to have. It’s exactly the sort of thing which is probably marginally useful. It’s unlikely to help much on the big core problems; it wouldn’t be much use for identifying unknown unknowns which nobody has written about before. But it would very likely help disseminate ideas, and be net-positive in terms of impact.
I do think a lot of the things you’re suggesting would be valuable and worth doing, on the margin. They’re probably not sufficient to close the bulk of the safety gap without theoretical progress on the core problems, but they’re still useful.
The “safety problems too complex for ourselves” are things like the fusion power generator scenario—i.e. safety problems in specific situations or specific applications. The safety problems which I don’t think are too complex are the general versions, i.e. how to build a generally-aligned AI.
An analogy: finding shortest paths in a billion-vertex graph is far too complex for me. But writing a general-purpose path-finding algorithm to handle that problem is tractable. In the same way, identifying the novel safety problems of some new technology will sometimes be too complex for humans. But writing a general-purpose safety-reasoning algorithm (i.e. an aligned AI) is tractable, I expect.
Ah ok, the suggestion makes sense now. That’s a good idea. It’s still punting a lot of problems until later, and humans would still be largely responsible for solving those problems later. But it could plausibly help with the core problems, without any obvious trade-off (assuming that the AI/oracle actually does end up pointed at corrigibility).
Well, I encourage you to come up with a specific way in which GPT-N will harm us by trying to write an AF post due to not having solved Alignment as Translation and add it as an answer in that thread. Given that we may be in an AI overhang, I’d like the answers to represent as broad a distribution of plausible harms as possible, because that thread might end up becoming very important & relevant very soon.