1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I don’t think a person can be described very precisely as having values, you need to do some work to get out something value-shaped. The easiest way is to combine a person with a deliberative process, and then make some assumption about the reflective equilibrium (e.g. that it’s rational). You will get different values depending on the choice of deliberative process, e.g. if I deliberate by writing I will generally get somewhat different values than if I deliberate by talking to myself. This path-dependence is starkest at the beginning and I expect it to decay towards 0. I don’t think that the difference between various forms of deliberation is likely to be too important, though prima facie it certainly could be.
Similarly for a government, there are lots of extrapolation procedures you can use and they will generally result in different values. I think we should be skeptical of forms of value learning that look like they make sense for people but not for groups of people. (That said, groups of people seem likely to have more path-dependence, so e.g. the choice of deliberative process may be more important for groups than individuals, and more generally individuals and groups can differ in degree if not in kind.)
On this perspective, (a) a human or government is not yet the kind of thing you can be aligned with, in my definition this was hidden in the word “wants,” which was maybe bad form but I was OK with because most people who think about this topic already appreciate the complexity of “wants,” (b) a human is unlikely to be aligned with anything, in the same sense that a pair of people with different values aren’t aligned with anything until they are sufficiently well-coordinated.
I don’t think that you would need to describe agency in order to build a corrigible AI. As an analogy: if you want to build an object that will be pushed in the direction the wind, you don’t need to give the object a definition of “wind,” and you don’t even need to have a complete definition of wind yourself. It’s sufficient for the person designing/analyzing the object to know enough facts about the wind that they can design/analyze sails.
1. If at the time of implementing ALBA, our conceptual understanding of corrigibility is the same as it is today, how doomed would you feel?
2. How are you imagining imposing an extra constraint that our model behave corrigibly on all inputs?
3. My current best guess is that your model of how to achieve corrigibility is to train the AI on a bunch of carefully labeled examples of corrigible behavior. To what extent is this accurate?
If we view the US government as a single entity, it’s not clear that it would make sense to describe it as aligned with itself, under your notion of alignment. If we consider an extremely akrasiatic human, it’s not clear that it would make sense to describe him as aligned with himself. The more agenty a human is, the more it seems to make sense to describe him as being aligned with himself.
If an AI assistant has a perfect model of what its operator approves of and only acts according to that model, it seems like it should qualify as aligned. But if the operator is very akrasiatic, should this AI still qualify as being aligned with the operator?
It seems to me that clear conceptual understandings of alignment, corrigibility, and benignity depend critically on a clear conceptual understanding of agency, which suggests a few things:
Significant conceptual understanding of corrigibility is at least partially blocked on conceptual progess on HRAD. (Unless you think the relevant notions of agency can mostly be formalized with ideas outside of HRAD? Or that conceptual understandings of agency are mostly irrelevant for conceptual understandings of corrigibility?)
Unless we have strong reasons to think we can impart the relevant notions of agency via labeled training data, we shouldn’t expect to be able to adequately impart corrigibility via labeled training data.
Without a clear conceptual notion of agency, we won’t have a clear enough concept of alignment or corrigibility we can use to make worst-case bounds.
I think a lot of folks who are confused about your claims about corrigibility share my intuitions around the nature of corrigibility / the difficulty of learning corrigibility from labeled data, and I think it would shed a lot of light if you shared more of your own views on this.
I don’t think a person can be described very precisely as having values, you need to do some work to get out something value-shaped. The easiest way is to combine a person with a deliberative process, and then make some assumption about the reflective equilibrium (e.g. that it’s rational). You will get different values depending on the choice of deliberative process, e.g. if I deliberate by writing I will generally get somewhat different values than if I deliberate by talking to myself. This path-dependence is starkest at the beginning and I expect it to decay towards 0. I don’t think that the difference between various forms of deliberation is likely to be too important, though prima facie it certainly could be.
Similarly for a government, there are lots of extrapolation procedures you can use and they will generally result in different values. I think we should be skeptical of forms of value learning that look like they make sense for people but not for groups of people. (That said, groups of people seem likely to have more path-dependence, so e.g. the choice of deliberative process may be more important for groups than individuals, and more generally individuals and groups can differ in degree if not in kind.)
On this perspective, (a) a human or government is not yet the kind of thing you can be aligned with, in my definition this was hidden in the word “wants,” which was maybe bad form but I was OK with because most people who think about this topic already appreciate the complexity of “wants,” (b) a human is unlikely to be aligned with anything, in the same sense that a pair of people with different values aren’t aligned with anything until they are sufficiently well-coordinated.
I don’t think that you would need to describe agency in order to build a corrigible AI. As an analogy: if you want to build an object that will be pushed in the direction the wind, you don’t need to give the object a definition of “wind,” and you don’t even need to have a complete definition of wind yourself. It’s sufficient for the person designing/analyzing the object to know enough facts about the wind that they can design/analyze sails.