“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)
Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.
I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.
We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?
Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?
The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.
Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)
These objections are all reasonable, and 3 is especially interesting to me—it seems like the biggest objection to the structure of the argument I gave. Thanks.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why—apologies if you’ve already tried to explain this and I just haven’t figured that out.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it.
I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s?
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.
Ah, gotcha. I’ll think about those points—I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.
It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.
It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).
So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)
“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)
Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.
I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.
We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?
Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?
The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.
Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)
These objections are all reasonable, and 3 is especially interesting to me—it seems like the biggest objection to the structure of the argument I gave. Thanks.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why—apologies if you’ve already tried to explain this and I just haven’t figured that out.
I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.
Ah, gotcha. I’ll think about those points—I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.
It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.
It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).
So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)