To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?
It worries me to depend on AI to do philosophy, without understanding what “philosophical reasoning” or “philosophical progress” actually consists of, i.e, without having solved metaphilosophy. I guess concretely there are two ways that automating philosophy could fail. 1) We just can’t get AI to do sufficiently good philosophy (in the relevant time frame), and it turns out to be a waste of time for human philosophers to help train AI philosophy (e.g. by evaluating their outputs and providing feedback) or to try to use them as assistants. 2) Using AI changes the trajectory of philosophical progress in a bad way (due to Goodhart, adversarial inputs, etc.), so that we end up accepting conclusions different from what we would have eventually decided on our own, or just wrong conclusions. It seems to me that humans are very prone to accepting bad philosophical ideas, but over the long run also have some mysterious way of collectively making philosophical progress. AI could exacerbate the former and disrupt the latter.
Curious if you’ve thought about this and what your own conclusions are. For example, does OpenAI have any backup plans in case 1 turns out to the case, or ideas for determining how likely 2 is or how to make it less likely?
Also, aside from this, what do you think are the biggest risks with OpenAI’s alignment approach? What’s your assessment of OpenAI leadership’s understanding of these risks?
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
Given this, what is your current plan around AI and philosophy? I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning? What are your current credences for A) this ends up working well B) it clearly doesn’t work C) it appears to work well but actually doesn’t (we end up getting fooled by persuasive but wrong arguments)? What’s the backup plan in case of B?
I’m concerned about C (for obvious reasons) and that if B is true by the time we learn it we’ll have AI that is human-level or better on every capability except philosophy, which seems a terrible position to be in. Given your quoted comment above, it seems that you can’t have a very high credence for A, but then how do you justify being overall optimistic?
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long.
It could be a short list and that’s still a big problem, right?
For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities
I may be missing your point here, but I’m not sure why this matters. If we get some of these questions wrong, we may end up throwing away large fractions of the potential value of the universe, so it seems important to solve them regardless of whether they’re categorized under “alignment” or “capabilities”. It’s possible that we could safely punt these questions for a while but also possible that we end up losing a lot of value that way (if we have to make highly consequential decisions/commitments/bargains during this time).
the moral patienthood questions you can punt on for a while from the longtermist point of view
But we can’t be sure that the longtermist view is correct, so in the interest of not risking committing a moral atrocity (in case a more neartermist view is correct) we still have to answer these questions. Answering these questions also seems urgent from the perspective that AIs (or humans for that matter) will pretty soon start making persuasive arguments that AIs deserve moral patienthood and hence legal rights, and “let’s ignore these arguments for now because of longtermism” probably won’t be a socially acceptable answer.
I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning?
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.
I guess it depends on the specific alignment approach being taken, such as whether you’re trying to build a sovereign or an assistant. Assuming the latter, I’ll list some philosophical problems that seem generally relevant:
metaphilosophy
How to solve new philosophical problems relevant to alignment as they come up?
How to help users when they ask the AI to attempt philosophical progress?
How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the user)?
How to enhance or at least not disrupt our collective ability to make philosophical progress?
metaethics
Should the AI always defer to the user or to OpenAI on ethical questions?
If not or if the user asks the AI to, how can it / should it try to make ethical determinations?
rationality
How should the AI try to improve its own thinking?
How to help the user be more rational (if they so request)?
normativity
How should the AI reason about “should” problems in general?
normative and applied ethics
What kinds of user requests should the AI refuse to fulfill?
What does it mean to help the user when their goals/values are confused or unclear?
When is it ok to let OpenAI’s interests override the user’s?
philosophy of mind
Which computations are conscious or constitute moral patients?
What exactly constitute pain or suffering (and therefore the AI should perhaps avoid helping the user create)?
How to avoid “mind crimes” within the AI’s own cognition/computation?
decision theory / game theory / bargaining
How to help the user bargain with other agents?
How to avoid (and help the user avoid) being exploited by others (including distant superintelligences)?
See also this list which I wrote a while ago. I wrote the above without first reviewing that post (to try to generate a new perspective).
To the extent that alignment research involves solving philosophical problems, it seems that in this approach we will also need to automate philosophy, otherwise alignment research will become bottlenecked on those problems (i.e., on human philosophers trying to solve those problems while the world passes them by). Do you envision automating philosophy (and are you optimistic about this) or see some other way of getting around this issue?
It worries me to depend on AI to do philosophy, without understanding what “philosophical reasoning” or “philosophical progress” actually consists of, i.e, without having solved metaphilosophy. I guess concretely there are two ways that automating philosophy could fail. 1) We just can’t get AI to do sufficiently good philosophy (in the relevant time frame), and it turns out to be a waste of time for human philosophers to help train AI philosophy (e.g. by evaluating their outputs and providing feedback) or to try to use them as assistants. 2) Using AI changes the trajectory of philosophical progress in a bad way (due to Goodhart, adversarial inputs, etc.), so that we end up accepting conclusions different from what we would have eventually decided on our own, or just wrong conclusions. It seems to me that humans are very prone to accepting bad philosophical ideas, but over the long run also have some mysterious way of collectively making philosophical progress. AI could exacerbate the former and disrupt the latter.
Curious if you’ve thought about this and what your own conclusions are. For example, does OpenAI have any backup plans in case 1 turns out to the case, or ideas for determining how likely 2 is or how to make it less likely?
Also, aside from this, what do you think are the biggest risks with OpenAI’s alignment approach? What’s your assessment of OpenAI leadership’s understanding of these risks?
Insofar that philosophical progress is required, my optimism for AI helping on this is lower than for (more) technical research since in philosophy evaluation is often much harder and I’m not sure that it’s always easier than generation. You can much more easily picture a highly charismatic AI-written manifesto that looks very persuasive and is very difficult to refute than it is to make technical claims about math, algorithms, or empirical data that are persuasive and hard to falsify.
However, I’m skeptical that the list of novel philosophical problems we actually need to solve to prevent the most serious misalignment risk will actually be that long. For example, a lot of problems in rationality + decision theory + game theory I’d count more as model capabilities and the moral patienthood questions you can punt on for a while from the longtermist point of view.
Given this, what is your current plan around AI and philosophy? I guess train AI to do various kinds of reasoning required for automated alignment research, using RRM, starting with the easier kinds and working your way up to philosophical reasoning? What are your current credences for A) this ends up working well B) it clearly doesn’t work C) it appears to work well but actually doesn’t (we end up getting fooled by persuasive but wrong arguments)? What’s the backup plan in case of B?
I’m concerned about C (for obvious reasons) and that if B is true by the time we learn it we’ll have AI that is human-level or better on every capability except philosophy, which seems a terrible position to be in. Given your quoted comment above, it seems that you can’t have a very high credence for A, but then how do you justify being overall optimistic?
It could be a short list and that’s still a big problem, right?
I may be missing your point here, but I’m not sure why this matters. If we get some of these questions wrong, we may end up throwing away large fractions of the potential value of the universe, so it seems important to solve them regardless of whether they’re categorized under “alignment” or “capabilities”. It’s possible that we could safely punt these questions for a while but also possible that we end up losing a lot of value that way (if we have to make highly consequential decisions/commitments/bargains during this time).
But we can’t be sure that the longtermist view is correct, so in the interest of not risking committing a moral atrocity (in case a more neartermist view is correct) we still have to answer these questions. Answering these questions also seems urgent from the perspective that AIs (or humans for that matter) will pretty soon start making persuasive arguments that AIs deserve moral patienthood and hence legal rights, and “let’s ignore these arguments for now because of longtermism” probably won’t be a socially acceptable answer.
Thinking about this a bit more, a better plan would be to train AI to do all kinds of reasoning required for automated alignment research in parallel, so that you can monitor what kinds of reasoning are harder for the AI to learn than the others, and have more of an early warning if it looked like that would cause the project to fail. Assuming your plan looks more like this, it would make me feel a little better about B although I’d still be concerned about C.
More generally, I tend to think that getting automated philosophy right is probably one of the hardest parts of automating alignment research (as well as the overall project of making the AI transition go well), so it makes me worried when alignment researchers don’t talk specifically about philosophy when explaining why they’re optimistic, and makes me want to ask/remind them about it. Hopefully that seems understandable instead of annoying.
What are the key philosophical problems you believe we need to solve for alignment?
I guess it depends on the specific alignment approach being taken, such as whether you’re trying to build a sovereign or an assistant. Assuming the latter, I’ll list some philosophical problems that seem generally relevant:
metaphilosophy
How to solve new philosophical problems relevant to alignment as they come up?
How to help users when they ask the AI to attempt philosophical progress?
How to help defend the user against bad philosophical ideas (whether in the form of virulent memes, or intentionally optimized by other AIs/agents to manipulate the user)?
How to enhance or at least not disrupt our collective ability to make philosophical progress?
metaethics
Should the AI always defer to the user or to OpenAI on ethical questions?
If not or if the user asks the AI to, how can it / should it try to make ethical determinations?
rationality
How should the AI try to improve its own thinking?
How to help the user be more rational (if they so request)?
normativity
How should the AI reason about “should” problems in general?
normative and applied ethics
What kinds of user requests should the AI refuse to fulfill?
What does it mean to help the user when their goals/values are confused or unclear?
When is it ok to let OpenAI’s interests override the user’s?
philosophy of mind
Which computations are conscious or constitute moral patients?
What exactly constitute pain or suffering (and therefore the AI should perhaps avoid helping the user create)?
How to avoid “mind crimes” within the AI’s own cognition/computation?
decision theory / game theory / bargaining
How to help the user bargain with other agents?
How to avoid (and help the user avoid) being exploited by others (including distant superintelligences)?
See also this list which I wrote a while ago. I wrote the above without first reviewing that post (to try to generate a new perspective).