I’ll have to eat the downvote for now—I think it’s worth it to use magic as a term of art, since it’s 11 fewer words than “stuff we need to remind ourselves we don’t know how to do,” and I’m not satisfied with “free parameters.”
I think it’s quite plausible that you don’t need much more work for shard theory alignment, because value formation really is that easy / robust.
But how do we learn that fact?
If extremely-confident-you says “the diamond-alignment post would literally work” and I say “what about these magical steps where you make choices without knowing how to build confidence in them beforehand” and extremely-confident-you says “don’t worry, most choices work fine because value formation is robust,” how did they learn that value formation is robust in that sense?
I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don’t think it’s plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don’t see a way to go from “suggestive” to “I am actually confident” that doesn’t involve de-mystifying the magic.
I think it’s worth it to use magic as a term of art, since it’s 11 fewer words than “stuff we need to remind ourselves we don’t know how to do,” and I’m not satisfied with “free parameters.”
11 fewer words, but I don’t think it communicates the intended concept!
If you have to say “I don’t mean one obvious reading of the title” as the first sentence, it’s probably not a good title. This isn’t a dig—titling posts is hard, and I think it’s fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:
“Uncertainties left open by Shard Theory”
“Limitations of Current Shard Theory”
“Challenges in Applying Shard Theory”
“Unanswered Questions of Shard Theory”
“Exploring the Unknowns of Shard Theory”
After considering these, I think that “Reminder: shard theory leaves open important uncertainties” is better than these five, and far better than the current title. I think a better title is quite within reach.
But how do we learn that fact?
I didn’t claim that I assign high credence to alignment just working out, I’m saying that it may as a matter of fact turn out that shard theory doesn’t “need a lot more work,” because alignment works out as a matter of fact from the obvious setups people try.
There’s a degenerate version of this claim, where ST doesn’t need more work because alignment is “just easy” for non-shard-theory reasons, and in that world ST “doesn’t need more work” because alignment itself doesn’t need more work.
There’s a less degenerate version of the claim, where alignment is easy for shard-theory reasons—e.g. agents robustly pick up a lot of values, many of which involve caring about us.
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
At best we need more abstract thought about this issue in order to figure out what an approach might even look like, and at worst I think this is a problem the necessitates a different approach.
And I think this is wrong. 2 can just be true, and we won’t justifiably know it. So I usually say “It is not known to me that I know how to solve alignment”, and not “I don’t know how to solve alignment.”
Since it was evidently A Thing, I have caved to peer pressure :P
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn’t work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.
This is because the way in which I think it’s plausible for it to be easy is some case (3) that’s even more restricted than (1) or (2). Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that’s aligned because of the shard theory alignment story.
To back up: in nontrivial cases, robustness doesn’t exist in a vacuum—you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we’re using to think about the problem—a good ontology / way of thinking about the problem makes the right degrees of freedom “obvious,” and makes it hard to do things totally wrong.
I think in real life, if we think “maybe this doesn’t need more work and just we don’t know it yet,” what’s actually going to happen is that for some of the degrees of freedom we need to set, we’re going to be using an ontology that allows for perturbations where the thing’s not robust, depressing the chances of success exponentially.
I’ll have to eat the downvote for now—I think it’s worth it to use magic as a term of art, since it’s 11 fewer words than “stuff we need to remind ourselves we don’t know how to do,” and I’m not satisfied with “free parameters.”
But how do we learn that fact?
If extremely-confident-you says “the diamond-alignment post would literally work” and I say “what about these magical steps where you make choices without knowing how to build confidence in them beforehand” and extremely-confident-you says “don’t worry, most choices work fine because value formation is robust,” how did they learn that value formation is robust in that sense?
I think it is unlikely but plausible that shard theory alignment could turn out to be easy, if only we had the textbook from the future. But I don’t think it’s plausible that getting that textbook is easy. Yes, we have arguments about human values that are suggestive, but I don’t see a way to go from “suggestive” to “I am actually confident” that doesn’t involve de-mystifying the magic.
Wouldn’t “Shard theory requires work” or “Shard theory requires novel insights” work?
Perhaps just [Shard theory alignment requires “magic”] to indicate that the word is used in a different way?
11 fewer words, but I don’t think it communicates the intended concept!
If you have to say “I don’t mean one obvious reading of the title” as the first sentence, it’s probably not a good title. This isn’t a dig—titling posts is hard, and I think it’s fair to not be satisfied with the one I gave. I asked ChatGPT to generate several new titles; lightly edited:
After considering these, I think that “Reminder: shard theory leaves open important uncertainties” is better than these five, and far better than the current title. I think a better title is quite within reach.
I didn’t claim that I assign high credence to alignment just working out, I’m saying that it may as a matter of fact turn out that shard theory doesn’t “need a lot more work,” because alignment works out as a matter of fact from the obvious setups people try.
There’s a degenerate version of this claim, where ST doesn’t need more work because alignment is “just easy” for non-shard-theory reasons, and in that world ST “doesn’t need more work” because alignment itself doesn’t need more work.
There’s a less degenerate version of the claim, where alignment is easy for shard-theory reasons—e.g. agents robustly pick up a lot of values, many of which involve caring about us.
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
And I think this is wrong. 2 can just be true, and we won’t justifiably know it. So I usually say “It is not known to me that I know how to solve alignment”, and not “I don’t know how to solve alignment.”
Does that make sense?
Since it was evidently A Thing, I have caved to peer pressure :P
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn’t work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.
This is because the way in which I think it’s plausible for it to be easy is some case (3) that’s even more restricted than (1) or (2). Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that’s aligned because of the shard theory alignment story.
To back up: in nontrivial cases, robustness doesn’t exist in a vacuum—you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we’re using to think about the problem—a good ontology / way of thinking about the problem makes the right degrees of freedom “obvious,” and makes it hard to do things totally wrong.
I think in real life, if we think “maybe this doesn’t need more work and just we don’t know it yet,” what’s actually going to happen is that for some of the degrees of freedom we need to set, we’re going to be using an ontology that allows for perturbations where the thing’s not robust, depressing the chances of success exponentially.