I’m glad to see people critiquing Paul’s work—it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of “benign”, I don’t think humans are benign, so I’m not going to argue with that. Instead, I’ll say what I think about building aligned AIs out of simulated human judgement.
I agree with you that listing and solving problems with such systems until we can’t think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won’t hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn’t feel like we’re at that point yet. I’m guessing the main difference here is that I’m hopeful about producing those arguments and you think it’s not likely to work.
Here’s an of example of how an argument might go. It’s sloppy, but I think it shows the flavor that makes me hopeful.
Meta-execution preserving a “non-corrupting” invariant:
i. define a naturally occurring set of queries nQ.
ii. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are “non-corrupting”).
iii. let Q be the closure of nQ under “Som spends an hour splitting q into sub-queries”.
iv. have some reason to think that Som’s processing never purposefully converts non-corrupting queries into corrupting ones.
v. have some defense against random noise producing corrupting nq or q.
vi. conclude that all q in Q are non-corrupting, and so the system won’t involve any value-drifted Soms.
There are some domains where this argument seems clearly true and Som isn’t just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain—no Go problems are corrupting—and Som’s processing doesn’t contribute to the truth of (iii).
For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som’s part to convert a non-scary q into a scary q’ and that Som wouldn’t want to do this unless they were already corrupted, and (v) can be made true by using a lot of different “noise seeds” and some kind of voting system to wash out noise-produced corruption.
Obviously this argument is frustratingly informal, and maybe I could become convinced that it can’t be strengthened, but I think I’d mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.
Paul seems to have another kind of argument for another kind of system in mind here (https://ai-alignment.com/aligned-search-366f983742e9), with a sketch of an argument at “I have a rough angle of attack in mind”. Obviously this isn’t an argument yet, but it seems worth looking into.
If you’re planning to follow up this post, I’d be most interested in whether you think it’s not likely to be possible to design a process that can we can be confident will avoid Sim drift. I’d also be interested to know if there are other approaches to alignment that seem more promising to you.
>(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour)
I don’t see what “naturally occurring” here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process “naturally occurring single pages of text”. And how would a system like this know whether a given input is “naturally occurring” and hence safe to process? Please explain?
“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)
Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.
I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.
We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?
Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?
The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.
Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)
These objections are all reasonable, and 3 is especially interesting to me—it seems like the biggest objection to the structure of the argument I gave. Thanks.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why—apologies if you’ve already tried to explain this and I just haven’t figured that out.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it.
I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s?
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.
Ah, gotcha. I’ll think about those points—I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level.
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.
It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.
It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).
So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)
My comment, for the record:
I’m glad to see people critiquing Paul’s work—it seems very promising to me relative to other alignment approaches, so I put high value on finding out about problems with it. By your definition of “benign”, I don’t think humans are benign, so I’m not going to argue with that. Instead, I’ll say what I think about building aligned AIs out of simulated human judgement.
I agree with you that listing and solving problems with such systems until we can’t think of more problems is unsatisfying, and that we should have positive arguments for confidence that we won’t hit unforeseen problems; maybe at some point we need to give up on getting those arguments and do the best we can without them, but it doesn’t feel like we’re at that point yet. I’m guessing the main difference here is that I’m hopeful about producing those arguments and you think it’s not likely to work.
Here’s an of example of how an argument might go. It’s sloppy, but I think it shows the flavor that makes me hopeful. Meta-execution preserving a “non-corrupting” invariant:
i. define a naturally occurring set of queries nQ.
ii. have some reason to think that nq in nQ are very unlikely to cause significant value drift in Som in 1 hour (nq are “non-corrupting”).
iii. let Q be the closure of nQ under “Som spends an hour splitting q into sub-queries”.
iv. have some reason to think that Som’s processing never purposefully converts non-corrupting queries into corrupting ones.
v. have some defense against random noise producing corrupting nq or q.
vi. conclude that all q in Q are non-corrupting, and so the system won’t involve any value-drifted Soms.
This kind of system would run sort of like your (2) or Paul’s meta-execution (https://ai-alignment.com/meta-execution-27ba9b34d377).
There are some domains where this argument seems clearly true and Som isn’t just being used as a microprocessor, e.g. Go problems or conjectures to be proven. In these cases it seems like (ii), (iii), and (iv) are true by virtue of the domain—no Go problems are corrupting—and Som’s processing doesn’t contribute to the truth of (iii).
For some other sets Q, it seems like (ii) will be true because of the nature of the domain (e.g. almost no naturally occurring single pages of text are value-corrupting in an hour), (iv) will be true because it would take significant work on Som’s part to convert a non-scary q into a scary q’ and that Som wouldn’t want to do this unless they were already corrupted, and (v) can be made true by using a lot of different “noise seeds” and some kind of voting system to wash out noise-produced corruption.
Obviously this argument is frustratingly informal, and maybe I could become convinced that it can’t be strengthened, but I think I’d mostly be convinced by trying and failing, and it seems reasonably likely to me that we could succeed.
Paul seems to have another kind of argument for another kind of system in mind here (https://ai-alignment.com/aligned-search-366f983742e9), with a sketch of an argument at “I have a rough angle of attack in mind”. Obviously this isn’t an argument yet, but it seems worth looking into.
FWIW, Paul is thinking and writing about about the kinds of problems you point out, e.g. in this post (https://ai-alignment.com/security-amplification-f4931419f903), this post (https://ai-alignment.com/reliability-amplification-a96efa115687), or this post (https://ai-alignment.com/implementing-our-considered-judgment-6c715a239b3e, search “virus” on that page). Not sure if his thoughts are helpful to you.
If you’re planning to follow up this post, I’d be most interested in whether you think it’s not likely to be possible to design a process that can we can be confident will avoid Sim drift. I’d also be interested to know if there are other approaches to alignment that seem more promising to you.
>(e.g. almost no naturally occurring single pages of text are value-corrupting in an hour)
I don’t see what “naturally occurring” here could mean (even informally) that would both make this statement true and make it useful to try to design a system that could safely process “naturally occurring single pages of text”. And how would a system like this know whether a given input is “naturally occurring” and hence safe to process? Please explain?
“naturally occurring” means “could be inputs to this AI system from the rest of the world”; naturally occurring inputs don’t need to be recognized, they’re here as a base case for the induction. Does that make sense?
If there are other really powerful reasoners in the world, then they could produce value-corrupting single pages of text (and I would then worry about Soms becoming corrupted). If there aren’t, I’d guess that possible input single pages of text aren’t value-corrupting in an hour. (I would certainly want a much better answer than “I guess it’s fine” if we were really running something like this.)
To clarify my intent here, I wanted to show a possible structure of an argument that could make us confident that value drift wasn’t going to kill us. If you think it’s really unlikely that any argument of this inductive form could be run, I’d be interested in that (or if Paul or someone else thought I’m on the wrong track / making some kind of fundamental mistake.)
Yes, that clarifies what you were intending to say. Paul typically assumes a need to compete with other AIs with comparable resources, so I wasn’t expecting “naturally occurring” to mean coming from an environment with no other powerful reasoners.
I think if we’re in an actual non-competitive scenario (and hence can tolerate large inefficiencies in our AI design), then some sort of argument like this can possibly be made to work, but it will probably be much trickier than you seem to suggest. Here are some problems I can see aside from the ones you’ve already acknowledged.
We probably don’t need “powerful reasoners” to produce value-corrupting single pages of text, just a model of Som and relatively crude optimization techniques. In other words, for this system to be safe you probably need more of a tech lead/monopoly than you had in mind. Exactly how much is needed seems hard to know, so how do you achieve high confidence of safety?
Presumably you’re building this system to be at least as capable as a human but at less risk of value drift. In order to do work comparable to a human contemplating the input query over a few days or years, the system needs some way of transmitting information between the short-lived Soms. Given that, what is preventing the overall system from undergoing value drift because, for example, one of the Soms has a (incorrect and value corrupting) philosophical epiphany and transmits it to other Soms though this communications channel? Nothing in your argument seems to adequately deal with this. If this happens by pure chance then you can try to wash it out by averaging over different “noise seeds” but what if Som has a tendency towards certain incorrect lines of argument, either naturally or because you placed him in a weird environment like this one?
The output of this system isn’t “naturally occurring” so subsequent inputs to it won’t be either. If we’re to use this system a second time in a way that preserves your “non-corrupting invariant”, we have to either prevent all information flow from the first output to the second input, or have another argument for why this system along with whatever part of human civilization the information flows through preserves “non-corrupting” as a whole. Otherwise, someone could for example submit a bunch of queries to the system that would help them eventually craft a value-corrupting single page of text, and disguise these queries as innocent queries.
Your point 2 is an excellent summary of my reasons for being skeptical of relying on human reasoning. (I also expect more outlandish means of transmissible value-corruption would show up, on the principles that edge cases are hard to predict and that we don’t really understand our minds.)
These objections are all reasonable, and 3 is especially interesting to me—it seems like the biggest objection to the structure of the argument I gave. Thanks.
I’m afraid that the point I was trying to make didn’t come across, or that I’m not understanding how your response bears on it. Basically, I thought the post was prematurely assuming that schemes like Paul’s are not amenable to any kind of argument for confidence, and we will only ever be able to say “well, I ran out of ideas for how to break it”, so I wanted to sketch an argument structure to explain why I thought we might be able to make positive arguments for safety.
Do you think it’s unlikely that we’ll be able to make positive arguments for the safety of schemes like Paul’s? If so, I’d be really interested in why—apologies if you’ve already tried to explain this and I just haven’t figured that out.
I guess my point was that any argument for confidence will likely be subject to the kinds of problems I listed, and I don’t see a realistic plan on Paul’s (or anyone else’s) part to deal with them.
It seems to me that in order to be able to make rigorous arguments about systems that are potentially subject to value drift, we have to understand metaphilosophy at a deep level. Without that, I don’t see how we can reason about a system that can encounter philosophical arguments, and make strong conclusions about whether it’s able to process them correctly. This seems intuitively obvious to me, but I don’t totally rule out that there is some sort of counterintuitive approach that could somehow work out.
Ah, gotcha. I’ll think about those points—I don’t have a good response. (Actually adding “think about”+(link to this discussion) to my todo list.)
Do you have a current best guess at an architecture that will be most amenable to us applying metaphilosophical insights to avoid value drift?
Interesting question. I guess it depends on the form that the metaphilosophical knowledge arrives in, but it’s currently hard to see what that could be. I can only think of two possibilities, but neither seem highly plausible.
It comes as a set of instructions that humans (or emulations/models of humans) can use to safely and correctly process philosophical arguments, along with justifications for why those instructions are safe/correct. Kind of like a detailed design for meta-execution along with theory/evidence for why it works. But natural language is fuzzy and imprecise, humans are full of unknown security holes, so it’s hard to see how such instructions could possibly make us safe/correct, or what kind of information could possibly make us confident of that.
It comes as a set of algorithms for reasoning about philosophical problems in a formal language, along with instructions/algorithms for how to translate natural language philosophical problems/arguments into this formal language, and justifications for why these are all safe/correct. But this kind of result seems very far from any of our current knowledge bases, nor does it seem compatible with any of the current trends in AI design (including things like deep learning, decision theory based ideas and Paul’s kinds of designs).
So I’m not very optimistic that a metaphilosophical approach will succeed either. If it ultimately does, it seems like maybe there will have to be some future insights whose form I can’t foresee. (Edit: Either that or a lot of time free from arms-race pressure to develop the necessary knowledge base and compatible AI design for 2.)