Ted Sanders comments on Why is o1 so deceptive?

Ted Sanders 1 Oct 2024 1:02 UTC
LW: 13 AF: 6
2
AF
>The artificially generated data includes hallucinated links.

Not commenting on OpenAI’s training data, but commenting generally: Models don’t hallucinate because they’ve been trained on hallucinated data. They hallucinate because they’ve been trained on real data, but they can’t remember it perfectly, so they guess. I hypothesize that URLs are very commonly hallucinated because they have a common, easy-to-remember format (so the model confidently starts to write them out) but hard-to-remember details (at which point the model just guesses because it knows a guessed URL is more likely than a URL that randomly cuts off after the http://www.).
What links here?
- abramdemski's comment on Why Don’t We Just… Shoggoth+Face+Paraphraser? by Daniel Kokotajlo (21 Nov 2024 19:13 UTC; 4 points)
- abramdemski 3 Oct 2024 3:10 UTC
  LW: 3 AF: 2
  1
  AF Parent
  I agree, but this doesn’t explain why it would (seemingly) encourage itself to hallucinate.
  - Htarlov 3 Oct 2024 23:36 UTC
    18 points
    4
    Parent
    Those models do not have a formalized internal values system that they exercise every time they produce some output. This means that when values oppose each other the model does not choose the answer based on some ordered system. One time it will be truthful, other times it will try to provide an answer at the cost of being only plausible. For example, the model “knows” it is not a human and does not have emotions, but for the sake of good conversation, it will say that it “feels good”. For the sake of answering the user’s request, it will often give the best guess or give a plausible answer.
    There is also no backward reflection. It does not check itself back.
    
    This of course comes from the way this model is currently learned. There is no learning on the whole CoT with checking for it trying to guess or deceive. So the model has no incentivization to self-check and correct. Why would it start to do that out of the blue?
    There is also incentivization during learning to give plausible answers instead of stating self-doubt and writing about missing parts that it cannot answer.
    There are two problems here:
    1. Those LLM models are not fully learned by human feedback (and the part where it is—it’s likely not the best quality feedback). It is more like interactions with humans are used to learn a “teacher” model(s) which then generate artificial scenarios and train LLM on them. Those models have no capability to check for real truthfulness and have a preference for confident plausible answers. Also, even human feedback is lacking—not every human working on that checks answers thoroughly so some plausible but not true answers slip through. If you are paid for a given amount of questions and answers or given a daily quota, there is an incentive to not be very thorough, but instead to be very quick.
    2. There is pressure for better performance and lower costs of the models (both in terms of training and then usage costs). This is probably why CoT is done in a rather bare way without backward self-checking and why they did not train it on full CoT. It could cost 1.5 to 3 times more and could be 1.5 to 2 times slower (educated guess) if it were trained on CoT and made to check itself on parts of CoT vs some coherent value system.
- dgros 1 Oct 2024 3:40 UTC
  3 points
  0
  Parent
  +1 here for the idea around how the models must commit to a URL once it starts, and that it can’t naturally cut off after starting. Presumably though the aspiration is that these reasoning/CoT-trained models could reflect back on the just completed URL and guess whether that is likely to be a real URL or not. If it’s not doing this check step, this might be a gap in the learned skills, more than intentional deception.