Writer comments on Evaluating the historical value misspecification argument

Writer 6 Oct 2023 18:33 UTC
2 points
1
I don’t speak for Matthew, but I’d like to respond to some points. My reading of his post is the same as yours, but I don’t fully agree with what you wrote as a response.
If you find something that looks to you like a solution to outer alignment / value specification, but it doesn’t help make an AI care about human values, then you’re probably mistaken about what actual problem the term ‘value specification’ is pointing at.

[...]

It was always possible to attempt to solve the value specification problem by just pointing at a human. The fact that we can now also point at an LLM and get a result that’s not all that much worse than pointing at a human is not cause for an update about how hard value specification is
My objection to this is that if an LLM can substitute for a human, it could train the AI system we’re trying to align much faster and for much longer. This could make all the difference.
If you could come up with a simple action-value function Q(observation, action), that when maximized over actions yields a good outcome for humans, then I think that would probably be helpful for alignment.
I suspect (and I could be wrong) that Q(observation, action) is basically what Matthew claims GPT-N could be. A human who gives moral counsel can only say so much and, therefore, can give less information to the model we’re trying to align. An LLM wouldn’t be as limited and could provide a ton of information about Q(observation, action), so we can, in practice, consider it as being our specification of Q(observation, action).

Edit: another option is that GPT-N, for the same reason of not being limited by speed, could write out a pretty huge Q(observation, action) that would be good, unlike a human.