cubefox comments on The Hidden Complexity of Wishes

cubefox 15 Feb 2024 19:14 UTC
10 points
3
I’m well aware of and agree there is a fundamental difference between knowing what we want and being motivated to do what we want. But as I wrote in the first paragraph:

Already LaMDA or InstructGPT (language models fine-tuned with supervised learning to follow instructions, essentially ChatGPT without any RLHF applied), are in fact pretty safe Oracles in regard to fulfilling wishes without misinterpreting you, and an Oracle AI is just a special kind of Genie whose actions are restricted to outputting text. If you tell InstructGPT what you want, it will very much try to give you just what you want, not something unintended, at least if it can be produced using text.

That is, instruction-tuned language models do not just understand (epistemics) what we want them to do, they additionally, to a large extent, do what we want them to do. They are good at executing our instructions. Not just at understanding our instructions but then doing something unintended.

(However, I agree they are probably not perfect at executing our instructions as we intended them. We might ask them to answer to the best of their knowledge, and they may instead answer with something that “sounds good” but is not what they in fact believe. Or, perhaps, as Gwern pointed out, they exhibit things like a strange tendency to answer our request for a non-rhyming poem with a rhyming poem, even though they may be well-aware, internally, that this isn’t what was requested.)