I kind of want to comment on this but am finding it hard to do so, so I’ll at least leave a comment expressing my frustration.
This post falls into some kind of uncanny valley of “feels wrong but both too much and not enough detail to criticize it directly”. There’s lots of wiggle room here with things underdefined in ways that are hard to really address and know if this seems reasonable or not. It pattern matches though to lots of things in the category of “hey, I just heard about alignment and I thought about it for a while and I think I see how to solve it” though misses the most egregious errors of that category of thing, which is why this is hard to say much about.
So I come away thinking I have no reason to think this will work but also unable to say anything specific about why I think it won’t work other than I think there’s a bunch of hidden details in here that are not being adequately explored.
I would reckon: no single AI safety method “will work” because no single method is enough by itself. The idea expressed in the post would not “solve” AI alignment, but I think it’s a thought-provoking angle on part of the problem.
I kind of want to comment on this but am finding it hard to do so, so I’ll at least leave a comment expressing my frustration.
This post falls into some kind of uncanny valley of “feels wrong but both too much and not enough detail to criticize it directly”. There’s lots of wiggle room here with things underdefined in ways that are hard to really address and know if this seems reasonable or not. It pattern matches though to lots of things in the category of “hey, I just heard about alignment and I thought about it for a while and I think I see how to solve it” though misses the most egregious errors of that category of thing, which is why this is hard to say much about.
So I come away thinking I have no reason to think this will work but also unable to say anything specific about why I think it won’t work other than I think there’s a bunch of hidden details in here that are not being adequately explored.
I would reckon: no single AI safety method “will work” because no single method is enough by itself. The idea expressed in the post would not “solve” AI alignment, but I think it’s a thought-provoking angle on part of the problem.