It’s easier to construct the shape of human values than MIRI thought. An almost good enough version of that shape is within RLHFed GPT-4, in its predictive model of text.
That sounds roughly accurate, but I’d frame it more as “It now seems easier to specify a function that reflects the human value function with high fidelity than what MIRI appears to have thought.” I’m worried about the ambiguity of “construct the shape of human values” since I’m making a point about value specification.
It still seems hard to get that shape into some AI’s values, which is something MIRI has always said.
This claim is consistent with what I wrote, but I didn’t actually argue it. I’m uncertain about whether inner alignment is difficult and I currently think we lack strong evidence about its difficulty.
Overall though I think you understood the basic points of the post.
That sounds roughly accurate, but I’d frame it more as “It now seems easier to specify a function that reflects the human value function with high fidelity than what MIRI appears to have thought.” I’m worried about the ambiguity of “construct the shape of human values” since I’m making a point about value specification.
This claim is consistent with what I wrote, but I didn’t actually argue it. I’m uncertain about whether inner alignment is difficult and I currently think we lack strong evidence about its difficulty.
Overall though I think you understood the basic points of the post.