I read the “List of Lethalities”, think I understood it pretty well, and I disagree with it in multiple places. I haven’t written those disagreements up like Paul did because I don’t expect that doing so would be particularly useful. I’ll try to explain why:
The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I’ve found a more accurate framing, from which I’ve drawn conclusions very different to those expressed in your list, such as:
Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we’re more inclined to generalize our values in ways that conform to the current social consensus, as compared to an “idealized” value forming/reflecting/generalizing process.
Relatedly, I think that “values meta-preferences” have a simple and fairly convergent core of how to do correct values reflection/generalization, in much the same way that “scientific discovery” has a simple, convergent core of how to do correct inference (i.e., Bayesianism[1]).
It’s possible for human and AI value systems to partially overlap to a non-trivial degree that’s robust to arbitrary capabilities gain on the part of the AI, such that a partially misaligned AI might still preserve humanity in a non-terrible state, depending on the exact degree and type of the misalignment.
The issue is that this list of disagreements relies on a framing which I’ve yet to write up properly. If you want to know whether or how much to update on my list, or how to go about disagreeing with the specifics of my beliefs, you’ll need to know the frame I’m using. Given inferential distance, properly introducing / explaining new frames is very difficult. Anyone interested can look at my current early draft for introducing the frame (though please take care not to let the current bad explanation inoculate you against a good idea).
So, my current plan is to continue working on posts that target deeper disagreements, even though there are many specific areas where I think the “List of Lethalities” is wrong.
Well, the correct answer here is probably actually infra-Bayesianism, or possibly something even weirder. The point is, it’s information-theoretic-simple and convergently useful for powerful optimizing systems.
I read the “List of Lethalities”, think I understood it pretty well, and I disagree with it in multiple places. I haven’t written those disagreements up like Paul did because I don’t expect that doing so would be particularly useful. I’ll try to explain why:
The core of my disagreement is that I think you are using a deeply mistaken framing of agency / values and how they arise in learning processes. I think I’ve found a more accurate framing, from which I’ve drawn conclusions very different to those expressed in your list, such as:
Human values are not as fragile as they introspectively appear. The felt sense of value fragility is, in large part, due to a type mismatch between the cognitive processes which form, implement, and store our values on the one hand and the cognitive processes by which we introspect on our current values on the other.
The processes by which we humans form/reflect on/generalize our values are not particularly weird among the space of processes able to form/reflect on/generalize values. Evolution pretty much grabbed the most accessible such process and minimally modified it in ways that are mostly irrelevant to alignment. E.g., I think we’re more inclined to generalize our values in ways that conform to the current social consensus, as compared to an “idealized” value forming/reflecting/generalizing process.
Relatedly, I think that “values meta-preferences” have a simple and fairly convergent core of how to do correct values reflection/generalization, in much the same way that “scientific discovery” has a simple, convergent core of how to do correct inference (i.e., Bayesianism[1]).
It’s possible for human and AI value systems to partially overlap to a non-trivial degree that’s robust to arbitrary capabilities gain on the part of the AI, such that a partially misaligned AI might still preserve humanity in a non-terrible state, depending on the exact degree and type of the misalignment.
The issue is that this list of disagreements relies on a framing which I’ve yet to write up properly. If you want to know whether or how much to update on my list, or how to go about disagreeing with the specifics of my beliefs, you’ll need to know the frame I’m using. Given inferential distance, properly introducing / explaining new frames is very difficult. Anyone interested can look at my current early draft for introducing the frame (though please take care not to let the current bad explanation inoculate you against a good idea).
So, my current plan is to continue working on posts that target deeper disagreements, even though there are many specific areas where I think the “List of Lethalities” is wrong.
Well, the correct answer here is probably actually infra-Bayesianism, or possibly something even weirder. The point is, it’s information-theoretic-simple and convergently useful for powerful optimizing systems.