The reasons the risk are great are standard arguments, so I am a little confused why you ask about this.
Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.
Then, I don’t understand why you believe that work on anything other than intent-alignment is much less urgent?
I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
In this situation, machine A is a much better plan.
(The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it’s not meant to properly model intent alignment vs. other stuff on other axes.)
This is actually an important lesson about why we need theory: to construct a useful theoretical model you don’t need to know all possible failure modes, you only need a reasonable set of assumptions.
I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)
In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don’t prove things or 2. prove things with unrealistic assumptions that don’t hold.
But if you’re sending a spaceship to Mars (or making a superintelligent AI), trial and error is too expensive. [...] Without theory you cannot extrapolate.
I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don’t build mathematical theories of how to write code, and usually don’t prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.
It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim. (But I might be misreading you here.)
I disagree. For example, [...]
Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that “human thinking for a day” is within F). I would expect that I would find some bridging assumption implausible in these settings.
I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
In this situation, machine A is a much better plan.
I am struggling to understand how does it work in practice. For example, consider dialogicRL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:
Basic question-answer protocol
Natural language annotation
Quantilization of questions
Debate over annotations
Dealing with no user answer
Dealing with inconsistent user answers
Dealing with changing user beliefs
Dealing with changing user preferences
Self-reference in user beliefs
Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
Reverse questions
Translation of counterfactuals from user frame to AI frame
User beliefs about computations
EDIT: 14. Confidence threshold for risky actions
Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can’t tell.
I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs.
I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don’t fall, don’t we?
I don’t build mathematical theories of how to write code, and usually don’t prove my code correct
On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim.
I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.
Which of these features are necessary for intent-alignment and which are only necessary for strong alignment?
As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
But, even just understanding whether we live in such a universe requires building a mathematical theory.
I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
Hmm. I appreciate the effort, but I don’t understand this answer. Maybe discussing this point further is not productive in this format.
I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation).
This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
Yes, and in that perspective, the mathematical model can tell me about resonance. It’s actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn’t mean we should be careless about AGI as well.
I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don’t understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don’t arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.
Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don’t literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly “intuitive”: I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don’t write proofs that are machine verifiable (some people do that today, but it’s a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).
Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.
Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It’s true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.
Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a “side benefit” of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification “merely” tells us that the implementation doesn’t have bugs (which is something we should definitely worry about too, when it becomes relevant).
I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
I’m not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don’t see how they could be settled only with this kind of reasoning.
But ultimately, you need theoretical knowledge to know what can be safely inferred from these experiments. Without theory you cannot extrapolate.
I’m struggling to understand what you mean by “theory” here, and the programming example was trying to get at that, but not very successfully. So let’s take the sandwich example:
I made a sandwich this morning without building mathematical theory :)
Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to “extrapolate” to figure out how to make a sandwich anyway. Why didn’t you need theory for that extrapolation?
Obviously this is a silly example, but I don’t currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
Sorry, I meant what are the reasons that the risk greater than the risk from a failure of intent alignment? The question was meant to be compared to the counterfactual of work on intent alignment, since the underlying disagreement is about comparing work on intent alignment to other AI safety work. Similarly for the question about why it might take a long time to solve.
I’m claiming that intent alignment captures a large proportion of possible failure modes, that seem particularly amenable to a solution.
Imagine that a fair coin was going to be flipped 21 times, and you need to say whether there were more heads than tails. By default you see nothing, but you could try to build two machines:
1. Machine A is easy to build but not very robust; it reports the outcome of each coin flip but has a 1% chance of error for each coin flip.
2. Machine B is hard to build but very robust; it reports the outcome of each coin flip perfectly. However, you only have a 50% chance of building it by the time you need it.
In this situation, machine A is a much better plan.
(The example is meant to illustrate the phenomenon by which you might want to choose a riskier but easier-to-create option; it’s not meant to properly model intent alignment vs. other stuff on other axes.)
I certainly agree with that. My motivation in choosing this example is that empirically we should not be able to prove that bridges are safe w.r.t resonance, because in fact they are not safe and do fall when resonance occurs. (Maybe today bridge-building technology has advanced such that we are able to do such proofs, I don’t know, but at least in the past that would not have been the case.)
In this case, we either fail to prove anything, or we make unrealistic assumptions that do not hold in reality and get a proof of safety. Similarly, I think in many cases involving properties about a complex real environment, your two options are 1. don’t prove things or 2. prove things with unrealistic assumptions that don’t hold.
I am not suggesting that we throw away all logic and make random edits to lines of code and try them out until we find a safe AI. I am simply saying that our things-that-allow-us-to-extrapolate need not be expressed in math with theorems. I don’t build mathematical theories of how to write code, and usually don’t prove my code correct; nonetheless I seem to extrapolate quite well to new coding problems.
It also sounds like you’re making a normative claim for proofs; I’m more interested in the empirical claim. (But I might be misreading you here.)
Certainly you can come up with bridging assumptions to bridge between levels of abstraction (in this case the assumption that “human thinking for a day” is within F). I would expect that I would find some bridging assumption implausible in these settings.
I am struggling to understand how does it work in practice. For example, consider dialogic RL. It is a scheme intended to solve AI alignment in the strong sense. The intent-alignment thesis seems to say that I should be able to find some proper subset of the features in the scheme which is sufficient for alignment in practice. I can approximately list the set of features as:
Basic question-answer protocol
Natural language annotation
Quantilization of questions
Debate over annotations
Dealing with no user answer
Dealing with inconsistent user answers
Dealing with changing user beliefs
Dealing with changing user preferences
Self-reference in user beliefs
Quantilization of computations (to combat non-Cartesian daemons, this is not in the original proposal)
Reverse questions
Translation of counterfactuals from user frame to AI frame
User beliefs about computations
EDIT: 14. Confidence threshold for risky actions
Which of these features are necessary for intent-alignment and which are only necessary for strong alignment? I can’t tell.
I am not an expert but I expect that bridges are constructed so that they don’t enter high-amplitude resonance in the relevant range of frequencies (which is an example of using assumptions in our models that need independent validation). We want bridges that don’t fall, don’t we?
On the other hand, I use mathematical models to write code for applications all the time, with some success I daresay. I guess that different experience produces different intuitions.
I am making both claims to some degree. I can imagine a universe in which the empirical claim is true, and I consider it plausible (but far from certain) that we live in such a universe. But, even just understanding whether we live in such a universe requires building a mathematical theory.
As far as I can tell, 2, 3, 4, and 10 are proposed implementations, not features. (E.g. the feature corresponding to 3 is “doesn’t manipulate the user” or something like that.) I’m not sure what 9, 11 and 13 are about. For the others, I’d say they’re all features that an intent-aligned AI should have; just not in literally all possible situations. But the implementation you want is something that aims for intent alignment; then because the AI is intent aligned it should have features 1, 5, 6, 7, 8. Maybe feature 12 is one I think is not covered by intent alignment, but is important to have.
This is probably true now that we know about resonance (because bridges have fallen down due to resonance); I was asking you to take the perspective where you haven’t yet seen a bridge fall down from resonance, and so you don’t think about it.
Maybe I’m falling prey to the typical mind fallacy, but I really doubt that you use mathematical models to write code in the way that I mean, and I suspect you instead misunderstood what I meant.
Like, if I asked you to write code to check if an element is present in an array, do you prove theorems? I certainly expect that you have an intuitive model of how your programming language of choice works, and that model informs the code that you write, but it seems wrong to me to describe what I do, what all of my students do, and what I expect you do as using a “mathematical theory of how to write code”.
I’m curious what you think doesn’t require building a mathematical theory? It seems to me that predicting whether or not we are doomed if we don’t have a proof of safety is the sort of thing the AI safety community has done a lot of without a mathematical theory. (Like, that’s how I interpret the rocket alignment and security mindset posts.)
Hmm. I appreciate the effort, but I don’t understand this answer. Maybe discussing this point further is not productive in this format.
Yes, and in that perspective, the mathematical model can tell me about resonance. It’s actually incredibly easy: resonance appears already in simple harmonic oscillators. Moreover, even if I did not explicitly understand resonance, if I proved that the bridge is stable under certain assumptions about external forces magnitudes and spacetime spectrum, it automatically guarantees that resonance will not crash the bridge (as long as the assumptions are realistic). Obviously people have not been so cautious over history, but that doesn’t mean we should be careless about AGI as well.
I understand the argument that sometimes creating and analyzing a realistic mathematical model is difficult. I agree that under time pressure it might be better to compromise on a combination of unrealistic mathematical models, empirical data and informal reasoning. But I don’t understand why should we give up so soon? We can work towards realistic mathematical models and prepare fallbacks, and even if we don’t arrive at a realistic mathematical model it is likely that the effort will produce valuable insights.
First, if I am asked to check whether an element is in an array, or some other easy manipulation of data structures, I obviously don’t literally start proving a theorem with pencil and paper. However, my not-fully-formal reasoning is such that I could prove a theorem if I wanted to. My model is not exactly “intuitive”: I could explicitly explain every step. And, this is exactly how all of mathematics works! Mathematicians don’t write proofs that are machine verifiable (some people do that today, but it’s a novel and tiny fraction of mathematics). They write proofs that are good enough so that all the informal steps can be easily made formal by anyone with reasonable background in the field (but actually doing that would be very labor intensive).
Second, what I actually meant is examples like, I am using an algorithm to solve a system of linear equations, or find the maximal matching in a graph, or find a rotation matrix that minimizes the sum of square distances between two sets, because I have a proof that this algorithm works (or, in some cases, a proof that it at least produces the right answer when it converges). Moreover, this applies to problems that explicitly involve the physical world as well, such as Kalman filters or control loops.
Of course, in the latter case we need to make some assumptions about the physical world in order to prove anything. It’s true that in applications the assumptions are often false, and we merely hope that they are good enough approximations. But, when the extra effort is justified, we can do better: we can perform a mathematical analysis of how much the violation of these assumptions affects the result. Then, we can use outside knowledge to verify that the violations are within the permissible margin.
Third, we could also literally prove machine-verifiable theorems about the code. This is called formal verification, and people do that sometimes when the stakes are high (as they definitely are with AGI), although in this case I have no personal experience. But, this is just a “side benefit” of what I was talking about. We need the mathematical theory to know that our algorithms are safe. Formal verification “merely” tells us that the implementation doesn’t have bugs (which is something we should definitely worry about too, when it becomes relevant).
I’m not sure about the scope of your question? I made a sandwich this morning without building mathematical theory :) I think that the AI safety community definitely produced some important arguments about AI risk, and these arguments are valid evidence. But, I consider most of the big questions to be far from settled, and I don’t see how they could be settled only with this kind of reasoning.
You made a claim a few comments above:
I’m struggling to understand what you mean by “theory” here, and the programming example was trying to get at that, but not very successfully. So let’s take the sandwich example:
Presumably the ingredients were in a slightly different configuration than you had ever seen them before, but you were still able to “extrapolate” to figure out how to make a sandwich anyway. Why didn’t you need theory for that extrapolation?
Obviously this is a silly example, but I don’t currently see any qualitative difference between sandwich-making-extrapolation, and the sort of extrapolation we do when we make qualitative arguments about AI risk. Why trust the former but not the latter? One is answer is that the latter is more complex, but you seem to be arguing something else.
I decided that the answer deserves its own post.