Since it was evidently A Thing, I have caved to peer pressure :P
“Shard theory doesn’t need more work” (in sense 2) could be true as a matter of fact, without me knowing it’s true with high confidence. If you’re saying “for us to become highly confident that alignment is going to work this way, we need more info”, I agree.
But I read you as saying “for this to work as a matter of fact, we need X Y Z additional research”:
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn’t work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.
This is because the way in which I think it’s plausible for it to be easy is some case (3) that’s even more restricted than (1) or (2). Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that’s aligned because of the shard theory alignment story.
To back up: in nontrivial cases, robustness doesn’t exist in a vacuum—you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we’re using to think about the problem—a good ontology / way of thinking about the problem makes the right degrees of freedom “obvious,” and makes it hard to do things totally wrong.
I think in real life, if we think “maybe this doesn’t need more work and just we don’t know it yet,” what’s actually going to happen is that for some of the degrees of freedom we need to set, we’re going to be using an ontology that allows for perturbations where the thing’s not robust, depressing the chances of success exponentially.
Since it was evidently A Thing, I have caved to peer pressure :P
Yeah, this is a good point. I do indeed think that just plowing ahead wouldn’t work as a matter of fact, even if shard theory alignment is easy-in-the-way-I-think-is-plausible, and I was vague about this.
This is because the way in which I think it’s plausible for it to be easy is some case (3) that’s even more restricted than (1) or (2). Like 3: If we could read the textbook from the future and use its ontology, maybe it would be easy / robust to build an RL agent that’s aligned because of the shard theory alignment story.
To back up: in nontrivial cases, robustness doesn’t exist in a vacuum—you have to be robust to some distribution of perturbations. For shard theory alignment to be easy, it hast to be robust to the choices we have to make about building AI, and specifically to the space of different ways we might make those choices. This space of different ways we could make choices depends on the ontology we’re using to think about the problem—a good ontology / way of thinking about the problem makes the right degrees of freedom “obvious,” and makes it hard to do things totally wrong.
I think in real life, if we think “maybe this doesn’t need more work and just we don’t know it yet,” what’s actually going to happen is that for some of the degrees of freedom we need to set, we’re going to be using an ontology that allows for perturbations where the thing’s not robust, depressing the chances of success exponentially.