It’s #1, with a light side order of #3 that doesn’t matter because #1.
I’m not sure where to start on explaining. How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world? If you can state this theorem you’ve done 250% of the work required to align an AGI.
Thanks, this helps. But I think what I was imagining wouldn’t be enough to let you put two cellular-identical strawberries on a plate without destroying the world? Rather, it would let you definitely put two cellular-identical strawberries on a plate, almost certainly while destroying the world.
My understanding is that, right now, we couldn’t design a paperclip maximizer if we tried; we’d just end up with something that is to paperclip maximization what we are to inclusive genetic fitness. (Is that even right?) That’s the problem that struck me as maybe-possibly amenable to proof search.
So, proving the theorem would give you a scheme that can do what evolution and gradient descent can’t (and faster than argmax). And then if you told it to make strawberries, it’d do that while destroying the world; if you told it to make strawberries without destroying the world, it’d do that too, but that would be a lot harder to express since value is fragile. So what I was imagining wouldn’t be enough to stop everyone from dying, but it would make progress on alignment.
(As for how I’d do it, I don’t know, I think I don’t understand what “optimization” even is, really 🙁)
Hopefully it makes sense what I’m asking — still the case that my intuition about it maybe-possibly being amenable to proof search is just wrong, y/n?
It’s #1, with a light side order of #3 that doesn’t matter because #1.
I’m not sure where to start on explaining. How would you state a theorem that an AGI would put two cellular-identical strawberries on a plate, including inventing and building all technology required to do that, without destroying the world? If you can state this theorem you’ve done 250% of the work required to align an AGI.
Thanks, this helps. But I think what I was imagining wouldn’t be enough to let you put two cellular-identical strawberries on a plate without destroying the world? Rather, it would let you definitely put two cellular-identical strawberries on a plate, almost certainly while destroying the world.
My understanding is that, right now, we couldn’t design a paperclip maximizer if we tried; we’d just end up with something that is to paperclip maximization what we are to inclusive genetic fitness. (Is that even right?) That’s the problem that struck me as maybe-possibly amenable to proof search.
So, proving the theorem would give you a scheme that can do what evolution and gradient descent can’t (and faster than argmax). And then if you told it to make strawberries, it’d do that while destroying the world; if you told it to make strawberries without destroying the world, it’d do that too, but that would be a lot harder to express since value is fragile. So what I was imagining wouldn’t be enough to stop everyone from dying, but it would make progress on alignment.
(As for how I’d do it, I don’t know, I think I don’t understand what “optimization” even is, really 🙁)
Hopefully it makes sense what I’m asking — still the case that my intuition about it maybe-possibly being amenable to proof search is just wrong, y/n?