For the last approach (corrigibility) to work, besides overseers/users who care about eventually getting philosophy right, it seems like we also need solutions to (intentional and unintentional) AI-mediated value corruption. (This part seems to be at least as much about AIs as about humans.) I don’t think I’ve seen anyone sketch out a plausible solution to these two problems yet (that doesn’t require solving hard philosophical problems like metaphilosophy). Do you agree? If not why not, and if yes why are you optimistic about that approach?
I’m super unsure about the intentional case, and agree that I want to see more work on that front, but it feels like a particular problem that can be solved with something like strategy/policy work. Put another way, intentional value corruption seems like a non-central example of problems that arise from philosophical difficulty. I agree that corrigibility + good overseers does not clearly solve it.
For the unintentional case, I think that overseers who care about getting philosophy right are going to think about value drift, because many of us are currently thinking about it. It seems like as long as the overseers make this apparent to the AI system and are sufficiently risk-averse, a corrigible AI system would take care not to corrupt their values. (The AI system might fail at this, but this doesn’t seem that likely to me, and it feels very hard to make progress on that particular point without more details on how the AI system works.)
I do think that we want to think about how to ensure that there are overseers who care about getting the questions right, who know about value drift, who will be sufficiently risk-averse, etc.
I’m super unsure about the intentional case, and agree that I want to see more work on that front, but it feels like a particular problem that can be solved with something like strategy/policy work.
What kind of strategy/policy work do you have in mind?
It seems like as long as the overseers make this apparent to the AI system and are sufficiently risk-averse, a corrigible AI system would take care not to corrupt their values.
Don’t we usually assume that the AI is ultimately corrigible to the user or otherwise has to cater to the user’s demands, because of competition between different AI providers? In that scenario, the end user also has to care about getting philosophy correct and being risk-averse for things to work out well, right? Or are you imagining some kind of monopoly or oligopoly situation where the AI providers all agree to be paternalistic and keep certain kinds of choices and technologies away from users? If so, how do you prevent AI tech from leaking out (ETA: or being reinvented) and enabling smaller actors from satisfying users’ risky demands? (ETA: Maybe you’re thinking of a scenario that’s more like 4 in my list?)
Another issue is that if AIs are not corrigible to end users but to overseers or their companies, that puts the overseers or companies in positions of tremendous power, which would be corrupting in its own way. I think the risk-averse thing to do would be to not put anyone in such situations, but it’s unclear how that can be accomplished (without other downsides). It seems that in general one could want to be risk-averse but not know how, so just having people be risk averse doesn’t seem enough to ensure safety.
Yet another issue is that in a fast moving world, a corrigible AI might need to query the overseer or user about lots of things that it’s unsure about. But it’s unclear what it’s supposed to do if such queries can themselves corrupt the overseer or user. Again just being risk-averse doesn’t seem be enough, and I don’t see a good solution within the corrigiblity approach that doesn’t involve solving hard philosophical problems.
What kind of strategy/policy work do you have in mind?
Assessing the incentives for whether or not people will try to intentionally corrupt values, as well as figuring out how to change those incentives if they exist. I don’t know exactly, my point was more that this seems like an incentive problem. How would you attack this from a technical angle—do you have to handcuff the AI to prevent it from ever corrupting values?
Don’t we usually assume that the AI is ultimately corrigible to the user or otherwise has to cater to the user’s demands, because of competition between different AI providers? In that scenario, the end user also has to care about getting philosophy correct and being risk-averse for things to work out well, right? Or are you imagining some kind of monopoly or oligopoly situation where the AI providers all agree to be paternalistic and keep certain kinds of choices and technologies away from users? If so, how do you prevent AI tech from leaking out (ETA: or being reinvented) and enabling smaller actors from satisfying users’ risky demands? (ETA: Maybe you’re thinking of a scenario that’s more like 4 in my list?)
Yes, AI systems sold to end users would be corrigible to them, but I’m hoping that most of the power is concentrated with the overseers. End users could certainly hurt themselves, but broader governance would prevent them from significantly harming everyone else. Maybe you’re worried about end users having their values corrupted and then because of democracy preventing us from getting most of the value? But even without value corruption I’d be quite afraid of end-user-defined democracy + powerful AI systems, and I assume you’d be too, so value corruption doesn’t seem to be the main issue.
Another issue is that if AIs are not corrigible to end users but to overseers or their companies, that puts the overseers or companies in positions of tremendous power, which would be corrupting in its own way.
Agreed that this is a problem.
It seems that in general one could want to be risk-averse but not know how, so just having people be risk averse doesn’t seem enough to ensure safety.
[...] it’s unclear what it’s supposed to do if such queries can themselves corrupt the overseer or user. [...]
In all of these cases, it seems like the problem is independent of AI. For risk aversion, if you wanted to solve it now, presumably you would try to figure out how to be risk-averse. But you could also do this with the assistance of an AI system. Perhaps the AI system does something risky while it is helping you figure out risk aversion? This doesn’t feel very likely to me.
For the second one, presumably the queries would also corrupt the human if the human thought of them? If you’d like to solve this problem by creating a theory of value corruption and using that to decide whether queries were going to corrupt values, couldn’t you do that with the assistance of the AI, and it waits on the potentially corrupting queries until that theory is complete?
For Alex’s point, if there are risks during the period that an AI is trying to become metaphilosophically competent that can’t be ignored, why aren’t there similar risks right now that can’t be ignored?
(These could all be arguments that we’re doomed and there’s no hope, but they don’t seem to be arguments that we should differentially be putting in current effort into them.)
I don’t know exactly, my point was more that this seems like an incentive problem.
It doesn’t have to be an either-or thing, and we could try to attack from both angles at once.
How would you attack this from a technical angle—do you have to handcuff the AI to prevent it from ever corrupting values?
The two approaches to this problem from the technical side that seem most promising to me are: A) solve metaphilosophy well enough that the AI can distinguish between good arguments and merely persuasive arguments and B) use my proposed hybrid approach to recover from corruption after the fact. These would fall under 2 and 3 in terms of the list in the OP.
(These could all be arguments that we’re doomed and there’s no hope, but they don’t seem to be arguments that we should differentially be putting in current effort into them.)
These were meant to be arguments that approach 5 (corrigibility) is “doomed”, and I gave them as a reply to your optimism about approach 5, with the implication that perhaps we should put more effort into some of the other approaches. Of course these arguments aren’t water tight so I hope there could be some creative technical ways to get around them within approach 5 too, but your statement “I am most optimistic that the last approach will “just work”” didn’t seem right to me and I wanted to point that out before it went into your newsletter.
These were meant to be arguments that approach 5 (corrigibility) is “doomed”
Aren’t they also equally powerful arguments that approaches 1-3 are doomed? I could see approach 4 as getting around the problem, though I’d hope that approach 4 could be subsumed under approach 5.
I agree that they are arguments against the statement “I am most optimistic that the last approach will “just work””. Would you agree with “The last approach seems to be the most promising to work on”?
Aren’t they also equally powerful arguments that approaches 1-3 are doomed?
I think no, because using either metaphilosophy or the hybrid approach involving idealized humans, an AI could potentially undo any corruption that happens to the user after it becomes powerful enough (i.e., by using superhuman persuasion or some other method).
Would you agree with “The last approach seems to be the most promising to work on”?
Maybe come back to this after we settle the above question?
I think no, because using either metaphilosophy or the hybrid approach involving idealized humans, an AI could potentially undo any corruption that happens to the user after it becomes powerful enough (i.e., by using superhuman persuasion or some other method).
Couldn’t the overseer and the corrigible AI together attempt to solve metaphilosophy / use the hybrid approach if that was most promising? (And to the extent that we could solve metaphilosophy / use the hybrid approach now, it should only be easier once we have a corrigible AI.)
Maybe come back to this after we settle the above question?
Couldn’t the overseer and the corrigible AI together attempt to solve metaphilosophy / use the hybrid approach if that was most promising? (And to the extent that we could solve metaphilosophy / use the hybrid approach now, it should only be easier once we have a corrigible AI.)
Solving metaphilosophy is itself a philosophical problem, so if we haven’t made much progress on metaphilosophy by the time we get human-level AI, AI probably won’t be able to help much with solving metaphilosophy (especially relative to accelerating technological progress).
Implementing the hybrid approach may be more of a technological problem but may still involve hard philosophical problems so it seems like a good idea to look more into it now to determine if that is the case and how feasible it looks overall (and hence how “doomed” approach 5 is, if approach 5 depends on implementing the hybrid approach at some point). Also it seems like a good idea to try to give the hybrid approach as much of a head start as possible, because any value corruption that occurs prior to corrigible AI switching to a hybrid design probably won’t get rolled back.
Maybe I should clarify that I’m not against people working on corrigibility, if they think that is especially promising or they have a comparative advantage for working on that. I mainly don’t want to see statements that are so strongly in favor of approach 5 as to discourage people from looking into the other approaches deeply enough to determine for themselves how promising those approaches are and whether they might be especially suited to working on those approaches. Does that seem reasonable to you?
Solving metaphilosophy is itself a philosophical problem, so if we haven’t made much progress on metaphilosophy by the time we get human-level AI, AI probably won’t be able to help much with solving metaphilosophy (especially relative to accelerating technological progress).
I could interpret this in two ways:
Conditioned on metaphilosophy being hard to solve, AI won’t be able to help us with it.
Conditioned on us not trying to solve metaphilosophy, AI won’t be able to help us with it.
The first interpretation is independent of whether or not we work on metaphilosophy, so it can’t be an argument for working on metaphilosophy.
The second interpretation seems false to me, and not because I think there are many considerations that overall come out to make it false—I don’t see any arguments in favor of it. Perhaps one argument is that if we don’t try to solve metaphilosophy, then AI won’t infer that we care about it, and so won’t optimize for it. But that seems very weak, since we can just say that we do care, and that’s much stronger evidence. We can also point out that we didn’t try to solve the problem because it wasn’t the most urgent one at the time.
Implementing the hybrid approach may be more of a technological problem but may still involve hard philosophical problems so it seems like a good idea to look more into it now to determine if that is the case and how feasible it looks overall (and hence how “doomed” approach 5 is, if approach 5 depends on implementing the hybrid approach at some point).
This suggests to me that you think that corrigible AI can’t help us figure out hard philosophical problems or metaphilosophy? That would also explain the paragraph above. If so, that’s definitely a crux for me, and I’d like to see arguments for that.
I guess you could also make this argument if you think AI is going to accelerate technological progress relative to (meta)philosophical progress. I probably agreed with this in the past, but now that I’m thinking more about it I’m not sure I agree any more. I suspect I was interpreting this as “technological progress will be faster than (meta)philosophical progress” instead of the actually-relevant “the gap between technological progress and (meta)philosophical progress will grow faster than it would have without AI”. Do you have arguments for this latter operationalization?
Background: I generally think humans are pretty “good” at technological progress and pretty “bad” at (meta)philosophical progress, and I think AI will be similar. If anything, I might expect the gap between the two to decrease, since humans are “just barely” capable of (meta)philosophical progress (animals aren’t capable of it, whereas they are somewhat capable of technological progress), and so there might be more room to improve. But this is based on what I expect are extremely fragile and probably wrong intuitions.
Also it seems like a good idea to try to give the hybrid approach as much of a head start as possible, because any value corruption that occurs prior to corrigible AI switching to a hybrid design probably won’t get rolled back.
This is also dependent on the crux above.
Maybe I should clarify that I’m not against people working on corrigibility, if they think that is especially promising or they have a comparative advantage for working on that.
I didn’t get the impression that you were against people working on corrigibility. Similarly, I’m not strongly against people working on metaphilosophy. What I’d like to do here is clarify what about metaphilosophy is likely to be necessary before we build powerful AI systems.
Does that seem reasonable to you?
Given your beliefs definitely. It’s reasonable by my beliefs too, though it’s not what I would do (obviously).
I suspect I was interpreting this as “technological progress will be faster than (meta)philosophical progress” instead of the actually-relevant “the gap between technological progress and (meta)philosophical progress will grow faster than it would have without AI”. Do you have arguments for this latter operationalization?
I thought from a previous comment that you already agree with the latter, but sure I can give an argument. It’s basically that the most obvious way of using ML to accelerate philosophical progress seems risky (compared to just having humans do philosophical work) and no one has proposed a better method, so unless this problem is solved in a better way, it looks like we’d have to either accept a faster growing gap between philosophical progress and technological progress, or incur extra risk from using ML to accelerate philosophical progress. See the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy for more details.
Background: I generally think humans are pretty “good” at technological progress and pretty “bad” at (meta)philosophical progress, and I think AI will be similar.
Aside from the above argument, I think we could end up creating AIs whose ratio between philosophical ability and technical ability is worse than human, if AI designers simply spent more resources on improving technical ability and neglected philosophical ability in comparison (e.g., because there is higher market demand for technical ability). Considering how much money is currently being invested into making technological progress vs philosophical progress in the overall economy, wouldn’t you expect something similar when it comes to AI? (I guess this is more of an argument for overall pessimism rather than for favoring one approach over another, but I still wanted to point out that I don’t agree with your relative optimism here.)
I thought from a previous comment that you already agree with the latter
Yeah, that’s why I said “I probably agreed with this in the past”. I’m not sure whether my underlying models changed or whether I didn’t notice the contradiction in my beliefs at the time.
It’s basically that the most obvious way of using ML to accelerate philosophical progress seems risky
It feels like this is true for the vast majority of plausible technological progress as well? E.g. most scientific experiments / designed technologies require real-world experimentation, which means you get very little data, making it very hard to naively automate with ML. I could make a just-so story where philosophy has much more data (philosophy writing), that is relatively easy to access (a lot of it is on the Internet), and so will be easier to automate.
My actual reason for not seeing much of a difference is that (conditional on short timelines) I expect that the systems we develop will be very similar to humans in the profile of abilities they have, because it looks like we will develop them in a manner similar to how humans were “developed” (I’m imagining development paths that look like e.g. OpenAI Five, AlphaStar, GPT-2 as described at SlateStarCodex). So the zeroth-order prediction is that there won’t be a relative difference between technological and philosophical progress. A very sketchy first-order prediction based on “there is lots of easily accessible philosophy data” suggests that philosophical progress will be differentially advanced.
Yeah, I agree that that particular method of making philosophical progress is not going to work.
I guess this is more of an argument for overall pessimism rather than for favoring one approach over another, but I still wanted to point out that I don’t agree with your relative optimism here.
Yeah, that’s basically my response.
I don’t have good arguments for my optimism (and I did remove it from the newsletter opinion for that reason). Nonetheless, I am optimistic. One argument is that over the past few centuries it seems like philosophical progress has been making the world better faster than technological progress has been causing bad distributional shifts—but of course even if our ancestors’ values had been corrupted we would not see it that way, so this isn’t a very good argument.
It feels like this is true for the vast majority of plausible technological progress as well? E.g. most scientific experiments / designed technologies require real-world experimentation, which means you get very little data, making it very hard to naively automate with ML. I could make a just-so story where philosophy has much more data (philosophy writing), that is relatively easy to access (a lot of it is on the Internet), and so will be easier to automate.
On the scientific/technological side, you can also use scientific/engineering papers (which I’m guessing has to be at least an order of magnitude greater in volume than philosophy writing), plus you have access to ground truths in the form of experiments and real world outcomes (as well as near-ground truths like simulation results) which has no counterpart in philosophy. My main point is that it seems a lot harder for technological progress to go “off the rails” due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.
My actual reason for not seeing much of a difference is that (conditional on short timelines) I expect that the systems we develop will be very similar to humans in the profile of abilities they have, because it looks like we will develop them in a manner similar to how humans were “developed”
I agree this could be a reason that things turn out well even if we don’t explicitly solve metaphilosophy or do something like my hybrid approach ahead of time. The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don’t understand, so we can’t rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though. If by the time we get human-level AI, this turns out not to be true, what are we going to do then? And even if we end up with AIs that appear to be able to help us with philosophy, without having solved metaphilosophy how would we know whether it’s actually helping or pushing us “off the rails”?
On the scientific/technological side, you can also use scientific/engineering papers (which I’m guessing has to be at least an order of magnitude greater in volume than philosophy writing)
This still seems like it is continuing the status quo (where we put more effort into technology relative to philosophy) rather than differentially benefitting technology.
My main point is that it seems a lot harder for technological progress to go “off the rails” due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.
Yeah, that seems right, to the extent that we want to use ML to “directly” work on technological / philosophical progress. To the extent that it has to factor through some more indirect method (e.g. through human reasoning as in iterated amplification) I think this becomes an argument to be pessimistic about solving metaphilosophy, but not that it will differentially benefit technological progress (or at least this depends on hard-to-agree-on intuitions).
I think there’s a strong argument to be made that you will have to go through some indirect method because there isn’t enough data to attack the problem directly.
(Fwiw, I’m also worried about the semi-supervised RL part of iterated amplification for the same reason.)
The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don’t understand, so we can’t rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though.
Yeah, I agree that this is a strong argument for your position.
For the last approach (corrigibility) to work, besides overseers/users who care about eventually getting philosophy right, it seems like we also need solutions to (intentional and unintentional) AI-mediated value corruption. (This part seems to be at least as much about AIs as about humans.) I don’t think I’ve seen anyone sketch out a plausible solution to these two problems yet (that doesn’t require solving hard philosophical problems like metaphilosophy). Do you agree? If not why not, and if yes why are you optimistic about that approach?
I’m super unsure about the intentional case, and agree that I want to see more work on that front, but it feels like a particular problem that can be solved with something like strategy/policy work. Put another way, intentional value corruption seems like a non-central example of problems that arise from philosophical difficulty. I agree that corrigibility + good overseers does not clearly solve it.
For the unintentional case, I think that overseers who care about getting philosophy right are going to think about value drift, because many of us are currently thinking about it. It seems like as long as the overseers make this apparent to the AI system and are sufficiently risk-averse, a corrigible AI system would take care not to corrupt their values. (The AI system might fail at this, but this doesn’t seem that likely to me, and it feels very hard to make progress on that particular point without more details on how the AI system works.)
I do think that we want to think about how to ensure that there are overseers who care about getting the questions right, who know about value drift, who will be sufficiently risk-averse, etc.
What kind of strategy/policy work do you have in mind?
Don’t we usually assume that the AI is ultimately corrigible to the user or otherwise has to cater to the user’s demands, because of competition between different AI providers? In that scenario, the end user also has to care about getting philosophy correct and being risk-averse for things to work out well, right? Or are you imagining some kind of monopoly or oligopoly situation where the AI providers all agree to be paternalistic and keep certain kinds of choices and technologies away from users? If so, how do you prevent AI tech from leaking out (ETA: or being reinvented) and enabling smaller actors from satisfying users’ risky demands? (ETA: Maybe you’re thinking of a scenario that’s more like 4 in my list?)
Another issue is that if AIs are not corrigible to end users but to overseers or their companies, that puts the overseers or companies in positions of tremendous power, which would be corrupting in its own way. I think the risk-averse thing to do would be to not put anyone in such situations, but it’s unclear how that can be accomplished (without other downsides). It seems that in general one could want to be risk-averse but not know how, so just having people be risk averse doesn’t seem enough to ensure safety.
Yet another issue is that in a fast moving world, a corrigible AI might need to query the overseer or user about lots of things that it’s unsure about. But it’s unclear what it’s supposed to do if such queries can themselves corrupt the overseer or user. Again just being risk-averse doesn’t seem be enough, and I don’t see a good solution within the corrigiblity approach that doesn’t involve solving hard philosophical problems.
BTW, Alex Zhu made a similar point in Acknowledging metaphilosophical competence may be insufficient for safe self-amplification.
Assessing the incentives for whether or not people will try to intentionally corrupt values, as well as figuring out how to change those incentives if they exist. I don’t know exactly, my point was more that this seems like an incentive problem. How would you attack this from a technical angle—do you have to handcuff the AI to prevent it from ever corrupting values?
Yes, AI systems sold to end users would be corrigible to them, but I’m hoping that most of the power is concentrated with the overseers. End users could certainly hurt themselves, but broader governance would prevent them from significantly harming everyone else. Maybe you’re worried about end users having their values corrupted and then because of democracy preventing us from getting most of the value? But even without value corruption I’d be quite afraid of end-user-defined democracy + powerful AI systems, and I assume you’d be too, so value corruption doesn’t seem to be the main issue.
Agreed that this is a problem.
In all of these cases, it seems like the problem is independent of AI. For risk aversion, if you wanted to solve it now, presumably you would try to figure out how to be risk-averse. But you could also do this with the assistance of an AI system. Perhaps the AI system does something risky while it is helping you figure out risk aversion? This doesn’t feel very likely to me.
For the second one, presumably the queries would also corrupt the human if the human thought of them? If you’d like to solve this problem by creating a theory of value corruption and using that to decide whether queries were going to corrupt values, couldn’t you do that with the assistance of the AI, and it waits on the potentially corrupting queries until that theory is complete?
For Alex’s point, if there are risks during the period that an AI is trying to become metaphilosophically competent that can’t be ignored, why aren’t there similar risks right now that can’t be ignored?
(These could all be arguments that we’re doomed and there’s no hope, but they don’t seem to be arguments that we should differentially be putting in current effort into them.)
It doesn’t have to be an either-or thing, and we could try to attack from both angles at once.
The two approaches to this problem from the technical side that seem most promising to me are: A) solve metaphilosophy well enough that the AI can distinguish between good arguments and merely persuasive arguments and B) use my proposed hybrid approach to recover from corruption after the fact. These would fall under 2 and 3 in terms of the list in the OP.
These were meant to be arguments that approach 5 (corrigibility) is “doomed”, and I gave them as a reply to your optimism about approach 5, with the implication that perhaps we should put more effort into some of the other approaches. Of course these arguments aren’t water tight so I hope there could be some creative technical ways to get around them within approach 5 too, but your statement “I am most optimistic that the last approach will “just work”” didn’t seem right to me and I wanted to point that out before it went into your newsletter.
Aren’t they also equally powerful arguments that approaches 1-3 are doomed? I could see approach 4 as getting around the problem, though I’d hope that approach 4 could be subsumed under approach 5.
I agree that they are arguments against the statement “I am most optimistic that the last approach will “just work””. Would you agree with “The last approach seems to be the most promising to work on”?
I think no, because using either metaphilosophy or the hybrid approach involving idealized humans, an AI could potentially undo any corruption that happens to the user after it becomes powerful enough (i.e., by using superhuman persuasion or some other method).
Maybe come back to this after we settle the above question?
Couldn’t the overseer and the corrigible AI together attempt to solve metaphilosophy / use the hybrid approach if that was most promising? (And to the extent that we could solve metaphilosophy / use the hybrid approach now, it should only be easier once we have a corrigible AI.)
Yeah, seems right.
Solving metaphilosophy is itself a philosophical problem, so if we haven’t made much progress on metaphilosophy by the time we get human-level AI, AI probably won’t be able to help much with solving metaphilosophy (especially relative to accelerating technological progress).
Implementing the hybrid approach may be more of a technological problem but may still involve hard philosophical problems so it seems like a good idea to look more into it now to determine if that is the case and how feasible it looks overall (and hence how “doomed” approach 5 is, if approach 5 depends on implementing the hybrid approach at some point). Also it seems like a good idea to try to give the hybrid approach as much of a head start as possible, because any value corruption that occurs prior to corrigible AI switching to a hybrid design probably won’t get rolled back.
Maybe I should clarify that I’m not against people working on corrigibility, if they think that is especially promising or they have a comparative advantage for working on that. I mainly don’t want to see statements that are so strongly in favor of approach 5 as to discourage people from looking into the other approaches deeply enough to determine for themselves how promising those approaches are and whether they might be especially suited to working on those approaches. Does that seem reasonable to you?
I could interpret this in two ways:
Conditioned on metaphilosophy being hard to solve, AI won’t be able to help us with it.
Conditioned on us not trying to solve metaphilosophy, AI won’t be able to help us with it.
The first interpretation is independent of whether or not we work on metaphilosophy, so it can’t be an argument for working on metaphilosophy.
The second interpretation seems false to me, and not because I think there are many considerations that overall come out to make it false—I don’t see any arguments in favor of it. Perhaps one argument is that if we don’t try to solve metaphilosophy, then AI won’t infer that we care about it, and so won’t optimize for it. But that seems very weak, since we can just say that we do care, and that’s much stronger evidence. We can also point out that we didn’t try to solve the problem because it wasn’t the most urgent one at the time.
This suggests to me that you think that corrigible AI can’t help us figure out hard philosophical problems or metaphilosophy? That would also explain the paragraph above. If so, that’s definitely a crux for me, and I’d like to see arguments for that.
I guess you could also make this argument if you think AI is going to accelerate technological progress relative to (meta)philosophical progress. I probably agreed with this in the past, but now that I’m thinking more about it I’m not sure I agree any more. I suspect I was interpreting this as “technological progress will be faster than (meta)philosophical progress” instead of the actually-relevant “the gap between technological progress and (meta)philosophical progress will grow faster than it would have without AI”. Do you have arguments for this latter operationalization?
Background: I generally think humans are pretty “good” at technological progress and pretty “bad” at (meta)philosophical progress, and I think AI will be similar. If anything, I might expect the gap between the two to decrease, since humans are “just barely” capable of (meta)philosophical progress (animals aren’t capable of it, whereas they are somewhat capable of technological progress), and so there might be more room to improve. But this is based on what I expect are extremely fragile and probably wrong intuitions.
This is also dependent on the crux above.
I didn’t get the impression that you were against people working on corrigibility. Similarly, I’m not strongly against people working on metaphilosophy. What I’d like to do here is clarify what about metaphilosophy is likely to be necessary before we build powerful AI systems.
Given your beliefs definitely. It’s reasonable by my beliefs too, though it’s not what I would do (obviously).
I thought from a previous comment that you already agree with the latter, but sure I can give an argument. It’s basically that the most obvious way of using ML to accelerate philosophical progress seems risky (compared to just having humans do philosophical work) and no one has proposed a better method, so unless this problem is solved in a better way, it looks like we’d have to either accept a faster growing gap between philosophical progress and technological progress, or incur extra risk from using ML to accelerate philosophical progress. See the section Replicate the trajectory with ML? of Some Thoughts on Metaphilosophy for more details.
Aside from the above argument, I think we could end up creating AIs whose ratio between philosophical ability and technical ability is worse than human, if AI designers simply spent more resources on improving technical ability and neglected philosophical ability in comparison (e.g., because there is higher market demand for technical ability). Considering how much money is currently being invested into making technological progress vs philosophical progress in the overall economy, wouldn’t you expect something similar when it comes to AI? (I guess this is more of an argument for overall pessimism rather than for favoring one approach over another, but I still wanted to point out that I don’t agree with your relative optimism here.)
Yeah, that’s why I said “I probably agreed with this in the past”. I’m not sure whether my underlying models changed or whether I didn’t notice the contradiction in my beliefs at the time.
It feels like this is true for the vast majority of plausible technological progress as well? E.g. most scientific experiments / designed technologies require real-world experimentation, which means you get very little data, making it very hard to naively automate with ML. I could make a just-so story where philosophy has much more data (philosophy writing), that is relatively easy to access (a lot of it is on the Internet), and so will be easier to automate.
My actual reason for not seeing much of a difference is that (conditional on short timelines) I expect that the systems we develop will be very similar to humans in the profile of abilities they have, because it looks like we will develop them in a manner similar to how humans were “developed” (I’m imagining development paths that look like e.g. OpenAI Five, AlphaStar, GPT-2 as described at SlateStarCodex). So the zeroth-order prediction is that there won’t be a relative difference between technological and philosophical progress. A very sketchy first-order prediction based on “there is lots of easily accessible philosophy data” suggests that philosophical progress will be differentially advanced.
Yeah, I agree that that particular method of making philosophical progress is not going to work.
Yeah, that’s basically my response.
I don’t have good arguments for my optimism (and I did remove it from the newsletter opinion for that reason). Nonetheless, I am optimistic. One argument is that over the past few centuries it seems like philosophical progress has been making the world better faster than technological progress has been causing bad distributional shifts—but of course even if our ancestors’ values had been corrupted we would not see it that way, so this isn’t a very good argument.
On the scientific/technological side, you can also use scientific/engineering papers (which I’m guessing has to be at least an order of magnitude greater in volume than philosophy writing), plus you have access to ground truths in the form of experiments and real world outcomes (as well as near-ground truths like simulation results) which has no counterpart in philosophy. My main point is that it seems a lot harder for technological progress to go “off the rails” due to having access to ground truths (even if that data is sparse) so we can push it much harder with ML.
I agree this could be a reason that things turn out well even if we don’t explicitly solve metaphilosophy or do something like my hybrid approach ahead of time. The way I would put it is that humans developed philosophical abilities for some mysterious reason that we don’t understand, so we can’t rule out AI developing philosophical abilities for the same reason. It feels pretty risky to rely on this though. If by the time we get human-level AI, this turns out not to be true, what are we going to do then? And even if we end up with AIs that appear to be able to help us with philosophy, without having solved metaphilosophy how would we know whether it’s actually helping or pushing us “off the rails”?
This still seems like it is continuing the status quo (where we put more effort into technology relative to philosophy) rather than differentially benefitting technology.
Yeah, that seems right, to the extent that we want to use ML to “directly” work on technological / philosophical progress. To the extent that it has to factor through some more indirect method (e.g. through human reasoning as in iterated amplification) I think this becomes an argument to be pessimistic about solving metaphilosophy, but not that it will differentially benefit technological progress (or at least this depends on hard-to-agree-on intuitions).
I think there’s a strong argument to be made that you will have to go through some indirect method because there isn’t enough data to attack the problem directly.
(Fwiw, I’m also worried about the semi-supervised RL part of iterated amplification for the same reason.)
Yeah, I agree that this is a strong argument for your position.