Whereas shard theory seems aimed at a model of human values that’s both accurate and conceptually simple.
Let’s distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it’s not reflectively stable and the proponents haven’t talked about how they plan to ensure that things will go well in the long run. If you’re talking about the former and I’m talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I’m wrong), so it seems important to consider that in any overall evaluation of shard theory?
BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.
The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI’s “genes” and “environment” will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
Humans are all partly or even mostly selfish, but we don’t want the AI to be. What’s the plan here, or reason to think that shard-based agents can be trained to not be selfish?
I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.
Let’s distinguish between shard theory as a model of human values, versus implementing an AI that learns its own values in a shard-based way. The former seems fine to me (pending further research on how well the model actually fits), but the latter worries me in part because it’s not reflectively stable and the proponents haven’t talked about how they plan to ensure that things will go well in the long run. If you’re talking about the former and I’m talking about the latter, then we might have been talking past each other. But I think the shard-theory proponents are proposing to do the latter (correct me if I’m wrong), so it seems important to consider that in any overall evaluation of shard theory?
BTW here are two other reasons for my worries. Again these may have already been addressed somewhere and I just missed it.
The AI will learn its own shard-based values which may differ greatly from human values. Even different humans learn different values depending on genes and environment, and the AI’s “genes” and “environment” will probably lie far outside the human distribution. How do we figure out what values we want the AI to learn, and how to make sure the AI learns those values? These seem like very hard research questions.
Humans are all partly or even mostly selfish, but we don’t want the AI to be. What’s the plan here, or reason to think that shard-based agents can be trained to not be selfish?
I think of shard theory as more than just a model of how to model humans.
My main point here is that human values will be represented in AIs in a form that looks a good deal more like the shard theory model than like a utility function.
Approaches that involve utility functions seem likely to make alignment harder, via adding an extra step (translating a utility function into shard form) and/or by confusing people about how to recognize human values.
I’m unclear whether shard theory tells us much about how to cause AIs to have the values we want them to have.
Also, I’m not talking much about the long run. I expect that problems with reflective stability will be handled by entities that have more knowledge and intelligence than we have.
Re shard theory: I think it’s plausibly useful, and maybe be a part of an alignment plan. But I’m quite a bit more negative than you or Turntrout on that plan, and I’d probably guess that Shard Theory ultimately doesn’t impact alignment that much.