james.lucassen comments on “Fragility of Value” vs. LLMs

james.lucassen 13 Apr 2022 3:25 UTC
8 points
Ah, gotcha. I think the post is fine, I just failed to read.
If I now correctly understand, the proposal is to ask a LLM to simulate human approval, and use that as the training signal for your Big Scary AGI. I think this still has some problems:
- Using an LLM to simulate human approval sounds like reward modeling, which seems useful. But LLM’s aren’t trained to simulate humans, they’re trained to predict text. So, for example, an LLM will regurgitate the dominant theory of human values, even if it has learned (in a Latent Knowledge sense) that humans really value something else.
- Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
I worry that I still might not understand your question, because I don’t see how fragility of value and orthogonality come into this?
- Yitz 13 Apr 2022 18:33 UTC
  4 points
  Parent
  Even if the simulation is perfect, using human approval isn’t a solution to outer alignment, for reasons like deception and wireheading
  It still does honestly seem way more likely to not kill us all than a paperclip-optimizer, so if we’re pressed for time near the end, why shouldn’t we go with this suggestion over something else?