I think you’re misunderstanding my point, let me know if I should change the question wording.
Assume we’re focused on outer alignment. Then we can provide a trained regressor LLM as the utility function, instead of Eg maximize paperclips. So understanding and valuing are synonymous in that setting.
Ah, gotcha. I think the post is fine, I just failed to read.
If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:
Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM’s aren’t trained to simulate humans, they’re trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
I worry that I still might not understand your question, because I don’t see how fragility of value and orthogonality come into this?
Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
It still does honestly seem way more likely to not kill us all than a paperclip-optimizer, so if we’re pressed for time near the end, why shouldn’t we go with this suggestion over something else?
I think you’re misunderstanding my point, let me know if I should change the question wording.
Assume we’re focused on outer alignment. Then we can provide a trained regressor LLM as the utility function, instead of Eg maximize paperclips. So understanding and valuing are synonymous in that setting.
Ah, gotcha. I think the post is fine, I just failed to read.
If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:
Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM’s aren’t trained to simulate humans, they’re trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
I worry that I still might not understand your question, because I don’t see how fragility of value and orthogonality come into this?
It still does honestly seem way more likely to not kill us all than a paperclip-optimizer, so if we’re pressed for time near the end, why shouldn’t we go with this suggestion over something else?