Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
Here is what I now believe you are/were saying:
Let’s define “aligned” to mean something like “would yield good outcomes if run with infinite computation.” Then we can describe our research goal in terms of “alignment” as something like “We want a version of technique X that has the same advantages as X but produces aligned agents.”
I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
Let me know if I am still misunderstanding you.
As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress).
(In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
>For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)
Here is a clarification of my previous comment, which I believe was based on a misunderstanding:
I don’t like the definition “an AGI is aligned if running it leads to good long-term outcomes” as a way of carving out a set of research problems or a research goal, because “AI alignment” then includes basically all x-risk relevant research. For example, it would include understanding physics relevant to possible high-energy physics catastrophes, and then making sure we give that information to our AGI so that it doesn’t inadvertently cause a physics catastrophe.
When I use “AI alignment,” I don’t want to include differential progress in fundamental physics that could help avoid catastrophes.
Your definition in the parent only requires good behavior in the limit of infinite computation, which I assumed was a way to make these other problems easy, and thereby exclude them from the definition. For example, if we have infinite computation, our AI can then do exhaustive Bayesian inference about possible theories of physics in order to make optimal decisions. And therefore progress in physics wouldn’t be relevant to AI alignment.
But I don’t think this trick works for separating out AI alignment problems in particular, because giving your AI infinite computation (while not giving competitors infinite computation) also eliminates most of the difficulties that we do want to think of as AI alignment.
Here is what I now believe you are/were saying:
I don’t think this is helpful either, because this “alignment” definition only tells us something about the behavior of our agent when we run it with infinite computation, and nothing about what happens when we run it in the real world. For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
Saying what “aligned” means in the limit of infinite computation may be a useful step towards giving a definition in the realistic case of finite computation (though I don’t see how to make progress along those lines). I would be inclined to give that concept some name like “asymptotically aligned” and then use “aligned” interchangeably with “actually aligned, as implemented in the real world.”
I also think defining asymptotic alignment is non-trivial. I’d try something like: “when run with infinite computing and perfect information about the operator, including the operator’s knowledge about the world, the system outputs optimal decisions according to the operator’s {preferences}” where {preferences} is a stand-in for some as-yet-undefined concept that includes the operator’s enlightened preferences, beliefs, decision theory, etc.
Let me know if I am still misunderstanding you.
As a meta note: My current take is that more precise definitions are useful, and that I should adjust any behavior that is causing easily-corrected misunderstanding, but that coming up with more precise definitions is lower priority than making progress on the problem (and will be easier after making progress). It might be more useful to anchor this discussion to some particular significant problems arising from our definitional unclarity, if you think that it’s an important enough issue to be worth spending time on.
(In addition to the other reasons I gave for prioritizing clarity of definitions/explanations) I’d like to help contribute to making forward progress on these things (despite not being as optimistic as you), but it’s hard to do that without first understanding your existing ideas and intuitions, and that’s hard to do while being confused about what your words mean. I think this probably also applies to others who would like to contribute to this research.
>For example, the kinds of difficulties we were discussing in this thread (which come from the AI’s imperfect models of humans) would be irrelevant according to this definition of alignment, but seem extremely important in practice.
In my comment that started this sub-thread, I asked “Do you consider this [your mindcrime example] a violation of alignment?” You didn’t give a direct yes or no answer, but I thought it was clear from what you wrote that the answer is “no” (and therefore you consider these kinds of difficulties to be irrelevant according to your own definition of alignment), which is why I proposed the particular formalization that I did. I thought you were saying that these kinds of difficulties are not relevant to “alignment” but are relevant to “safety”. Did I misunderstand your answer, or perhaps you misunderstood my question, or something else?
I don’t think {not noticing that mindcrime is a problem} is a violation of alignment: the AI is trying to do what you want but makes a moral error.
I do think {if the AI is too weak, it secretly plots to kill everyone} is a violation of alignment: the AI isn’t trying to do what you want. It knows that you don’t want it to kill everyone, that’s why it’s trying to keep it secret.
(It’s technically possible for an AI to kill everyone, and even to secretly kill everyone, because it is trying to do what you want but makes a mistake. This seems like an inevitable feature of any sensible definition of alignment. I expect to now have an involved discussion about what the difference is.)
(Wei Dai and I discussed my definition of alignment offline, leading to this post which hopefully clarifies things a little bit in addition to summarizing the takeaways from this thread.)