Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
Given this, what is your current plan around AI and philosophy? I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning? What are your current credences for A) this ends up working well B) it clearly doesn’t work C) it appears to work well but actually doesn’t (we end up getting fooled by persuasive but wrong arguments)? What’s the backup plan in case of B?
I’m concerned about C (for obvious reasons) and that if B is true by the time we learn it we’ll have AI that is human-level or better on every capability except philosophy, which seems a terrible position to be in. Given your quoted comment above, it seems that you can’t have a very high credence for A, but then how do you justify being overall optimistic?
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long.
It could be a short list and that’s still a big problem, right?
For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities
I may be missing your point here, but I’m not sure why this matters. If we get some of these questions wrong, we may end up throwing away large fractions of the potential value of the universe, so it seems important to solve them regardless of whether they’re categorized under “alignment” or “capabilities”. It’s possible that we could safely punt these questions for a while but also possible that we end up losing a lot of value that way (if we have to make highly consequential decisions/commitments/bargains during this time).
the moral patienthood questions you can punt on for a while from the longtermist point of view
But we can’t be sure that the longtermist view is correct, so in the interest of not risking committing a moral atrocity (in case a more neartermist view is correct) we still have to answer these questions. Answering these questions also seems urgent from the perspective that AIs (or humans for that matter) will pretty soon start making persuasive arguments that AIs deserve moral patienthood and hence legal rights, and “let’s ignore these arguments for now because of longtermism” probably won’t be a socially acceptable answer.
I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning?
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
Given this, what is your current plan around AI and philosophy? I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning? What are your current credences for A) this ends up working well B) it clearly doesn’t work C) it appears to work well but actually doesn’t (we end up getting fooled by persuasive but wrong arguments)? What’s the backup plan in case of B?
I’m concerned about C (for obvious reasons) and that if B is true by the time we learn it we’ll have AI that is human-level or better on every capability except philosophy, which seems a terrible position to be in. Given your quoted comment above, it seems that you can’t have a very high credence for A, but then how do you justify being overall optimistic?
It could be a short list and that’s still a big problem, right?
I may be missing your point here, but I’m not sure why this matters. If we get some of these questions wrong, we may end up throwing away large fractions of the potential value of the universe, so it seems important to solve them regardless of whether they’re categorized under “alignment” or “capabilities”. It’s possible that we could safely punt these questions for a while but also possible that we end up losing a lot of value that way (if we have to make highly consequential decisions/commitments/bargains during this time).
But we can’t be sure that the longtermist view is correct, so in the interest of not risking committing a moral atrocity (in case a more neartermist view is correct) we still have to answer these questions. Answering these questions also seems urgent from the perspective that AIs (or humans for that matter) will pretty soon start making persuasive arguments that AIs deserve moral patienthood and hence legal rights, and “let’s ignore these arguments for now because of longtermism” probably won’t be a socially acceptable answer.
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.