I am taking it as given here that powerful agents are dangerous by default, unless they happen to want exactly what you want them to want.
If you mean “powerful agents are very likely to cause catastrophe by default, unless they happen to want exactly what you want them to want.”, this is where I want more rigor. You still need to define the “powerful” property that causes danger, because right now this sentence could be smuggling in assumptions, e.g. complete preferences.
Hmm, I agree the claim could be made more rigorous, but the usage here isn’t intended to claim anything that isn’t a direct consequence of the Orthogonality Thesis. I’m just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
And my position is that it is often the opposing viewpoint—e.g. that danger (in an intuitive sense) depends on precise formulations of coherence or other assumptions from VNM rationality—that is smuggling in assumptions.
For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
But regardless of whether that specific claim is true or false or technically false, my larger point is that the demand for rigor here feels kind of backwards, or at least out of order. I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible), the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that’s actually true.
I don’t think I could name a working method for constructing a safe powerful mind. What I want to say is more like: if you want to deconfuse some core AI x-risk problems, you should deconfuse your basic reasons for worry and core frames first, otherwise you’re building on air.
but the usage here isn’t intended to claim anything that isn’t a direct consequence of the Orthogonality Thesis.
I want to flag that the orthogonality thesis is incapable of supporting the assumption that powerful agents are dangerous by default, and the reason is that it only makes a possibility claim, not anything stronger than that. I think you need the assumption of 0 prior information in order to even vaguely support the hypothesis that AI is dangerous by default.
I’m just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
I think a critical disagreement is probably that I think even weak prior information shifts things to AI is safe by default, and that we don’t need to specify most of our values and instead offload most of the complexity to the learning process.
I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible) the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that’s actually true.
I definitely disagree with that, and accepting the orthogonality thesis plus powerful minds are possible is not enough to shift the burden of proof unless something else is involved, since as stated it excludes basically nothing.
For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
I don’t see how this result contradicts my claim. If you can construct an agent with incomplete preferences that follows Dynamic Strong Maximality, you can just as easily (or more easily) construct an agent with complete preferences that doesn’t need to follow any such rule.
Also, if DSM works in practice and doesn’t impose any disadvantages on an agent following it, a powerful agent with incomplete preferences following DSM will probably still tend to get what it wants (which may not be what you want).
Constructing a DSM agent seems like a promising avenue if you need the agent to have weird / anti-natural preferences, e.g. total indifference to being shut down. But IIRC, the original shutdown problem was never intended to be a complete solution to the alignment problem, or even a practical subcomponent. It was just intended to show that a particular preference that is easy to describe in words and intuitively desirable as a safety property, is actually pretty difficult to write down in a way that fits into various frameworks for describing agents and their preferences in precise ways.
If you mean “powerful agents are very likely to cause catastrophe by default, unless they happen to want exactly what you want them to want.”, this is where I want more rigor. You still need to define the “powerful” property that causes danger, because right now this sentence could be smuggling in assumptions, e.g. complete preferences.
Hmm, I agree the claim could be made more rigorous, but the usage here isn’t intended to claim anything that isn’t a direct consequence of the Orthogonality Thesis. I’m just saying that agents capable of exerting large effects on their world are by default going to be dangerous, in the sense that the set of all possible large effects is large compared to the set of desirable large effects.
And my position is that it is often the opposing viewpoint—e.g. that danger (in an intuitive sense) depends on precise formulations of coherence or other assumptions from VNM rationality—that is smuggling in assumptions.
For example, on complete preferences, here’s a slightly more precise claim: any interesting and capable agent with incomplete preferences implies the possibility (via an often trivial construction) of a similarly-powerful agent with complete preferences, and that the agent with complete preferences will often be simpler and more natural in an intuitive sense.
But regardless of whether that specific claim is true or false or technically false, my larger point is that the demand for rigor here feels kind of backwards, or at least out of order. I think once you accept things which are relatively uncontroversial around here (the orthogonality thesis is true, powerful artificial minds are possible), the burden is then on the person claiming that some method for constructing a powerful mind (e.g. give it incomplete preferences) will not result in that mind being dangerous (or epsilon distance in mind-space from a mind that is dangerous) to show that that’s actually true.
I don’t think I could name a working method for constructing a safe powerful mind. What I want to say is more like: if you want to deconfuse some core AI x-risk problems, you should deconfuse your basic reasons for worry and core frames first, otherwise you’re building on air.
I could, and my algorithm basically boils down to the following:
Specify a weak/limited prior over goal space, like the genome does.
Create a preference model by using DPO, RLHF or whatever else suits your fancy to guide the intelligence into alignment with x values.
Use the backpropagation algorithm to update the weights of the brain in the optimal direction for alignment.
Repeat until you get low loss, or until you can no longer optimize the function anymore.
I want to flag that the orthogonality thesis is incapable of supporting the assumption that powerful agents are dangerous by default, and the reason is that it only makes a possibility claim, not anything stronger than that. I think you need the assumption of 0 prior information in order to even vaguely support the hypothesis that AI is dangerous by default.
I think a critical disagreement is probably that I think even weak prior information shifts things to AI is safe by default, and that we don’t need to specify most of our values and instead offload most of the complexity to the learning process.
I definitely disagree with that, and accepting the orthogonality thesis plus powerful minds are possible is not enough to shift the burden of proof unless something else is involved, since as stated it excludes basically nothing.
This is not the case under things like invulnerable incomplete preferences, where they managed to weaken the axioms of EU theory enough to get a shutdownable agent:
https://www.lesswrong.com/posts/sHGxvJrBag7nhTQvb/invulnerable-incomplete-preferences-a-formal-statement-1
I don’t see how this result contradicts my claim. If you can construct an agent with incomplete preferences that follows Dynamic Strong Maximality, you can just as easily (or more easily) construct an agent with complete preferences that doesn’t need to follow any such rule.
Also, if DSM works in practice and doesn’t impose any disadvantages on an agent following it, a powerful agent with incomplete preferences following DSM will probably still tend to get what it wants (which may not be what you want).
Constructing a DSM agent seems like a promising avenue if you need the agent to have weird / anti-natural preferences, e.g. total indifference to being shut down. But IIRC, the original shutdown problem was never intended to be a complete solution to the alignment problem, or even a practical subcomponent. It was just intended to show that a particular preference that is easy to describe in words and intuitively desirable as a safety property, is actually pretty difficult to write down in a way that fits into various frameworks for describing agents and their preferences in precise ways.