quetzal_rainbow comments on Try to solve the hard parts of the alignment problem

quetzal_rainbow 30 May 2024 8:26 UTC
9 points
7
There is an old Russian joke: ant wants to steal two elephants. It thinks: “Let’s concentrate on moving first elephant and deal with second later”. It carefully avoids question: “How are you going to move even one elephant?”

Your comment has the same vibes.

Like, how are you going to avoid unusual circumstances during nanotech design which is literally the most unusual tech enterprise in history?

How are you going to create “simulator of ethical reasoner”? My point is that LLMs are simulators in general and they don’t stop to be simulators after RLHF and instruct-tuning. You can’t just pick one persona from overall simulator arsenal and keep it.

How do you plan to make it “supercompetent”? We don’t have supercompetent ethical reasoners in training dataset, so you can’t rely on, say, simularity with human reasoning.

And I don’t think that overall modular schema is workable. Your “ethical” module would require non-trivial technical knowledge to evaluate all proposals even if design modules try to explain their reasoning as simple as possible. So your plan actually doesn’t differ from “train LLM to do very non-trivial scientific research, do RLHF, hope that RLHF generalizes (it doesn’t)”.