… assuming the values you want are learnable and “convergeable” upon. “Alignment” doesn’t even necessarily have a coherent meaning.
Actual humans aren’t “aligned” with each other, and they may not be consistent enough that you can say they’re always “aligned” with themselves. Most humans’ values seem to drive them toward vaguely similar behavior in many ways… albeit with lots of very dramatic exceptions. How they articulate their values and “justify” that behavior varies even more widely than the behavior itself. Humans are frequently willing to have wars and commit various atrocities to fight against legitimately human values other than their own. Yet humans have the advantage of starting with a lot of biological commonality.
The idea that there’s some shared set of values that a machine can learn that will make everybody even largely happy seems, um, naive. Even the idea that it can learn one person’s values, or be engineered to try, seems really optimistic.
Anyway, even if the approach did work, that would just mean that “its own ideas” were that it had to learn about and implement your (or somebody’s?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
Actual humans aren’t “aligned” with each other, and they may not be consistent enough that you can say they’re always “aligned” with themselves.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
Anyway, even if the approach did work, that would just mean that “its own ideas” were that it had to learn about and implement your (or somebody’s?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
The point is that you now get one shot at a far simpler task: defining “your purpose as an AI is to learn about and implement the humans’ collective values” is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model’s ideas on how to do that research being sound, that’s a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it’s not much of an alignment risk.
Not if you built a model that does (or on reflection decides to do) value learning: then you instead get to be its research subject and interlocutor while it figures out its ideas. But yes, you do need to start the model off close enough to aligned that it converges to value learning.
… assuming the values you want are learnable and “convergeable” upon. “Alignment” doesn’t even necessarily have a coherent meaning.
Actual humans aren’t “aligned” with each other, and they may not be consistent enough that you can say they’re always “aligned” with themselves. Most humans’ values seem to drive them toward vaguely similar behavior in many ways… albeit with lots of very dramatic exceptions. How they articulate their values and “justify” that behavior varies even more widely than the behavior itself. Humans are frequently willing to have wars and commit various atrocities to fight against legitimately human values other than their own. Yet humans have the advantage of starting with a lot of biological commonality.
The idea that there’s some shared set of values that a machine can learn that will make everybody even largely happy seems, um, naive. Even the idea that it can learn one person’s values, or be engineered to try, seems really optimistic.
Anyway, even if the approach did work, that would just mean that “its own ideas” were that it had to learn about and implement your (or somebody’s?) values, and also that its ideas about how to do that are sound. You still have to get that right before the first time it becomes uncontrollable. One chance, no matter how you slice it.
Completely agreed, see for example my post 3. Uploading which makes this exact point at length.
True. Or, as I put it just above:
The point is that you now get one shot at a far simpler task: defining “your purpose as an AI is to learn about and implement the humans’ collective values” is a lot more compact, and a lot easier to get right first time, than an accurate description of human values in their full large-and-fairly-fragile details. As I demonstrate in the post linked to in that quote, the former, plus its justification as being obvious and stable under reflection, can be described in exhaustive detail on a few pages of text.
As for the the model’s ideas on how to do that research being sound, that’s a capabilities problem: if the model is incapable of performing a significant research project when at least 80% of the answer is already in human libraries, then it’s not much of an alignment risk.