gwern answers Why is o1 so deceptive?

gwern 28 Sep 2024 0:57 UTC
LW: 8 AF: 6
0
AF
Why do you think o1 would know that it’s making up references? The fact that it can’t retrieve URLs is completely different from, and unrelated to, its knowledge of URLs. LLMs do not need retrieval to know about many web pages. GPT-3 and GPT-4 know tons about tons of URLs! (In fact, I actively exploit this to save myself effort linking stuff on gwern.net—no retrieval necessary.)

Let’s take the Sally link. The URL may not exist, but Sally’s Baking Addiction does (unsurprisingly, as there are lots of cooking-related websites and why confabulate more than necessary) and has many brownie recipes, and some ginger probing of 4o (to avoid any issues with o1 training and just try to understand the knowledge of the baseline) suggests that 4o finds the URL real and doesn’t discuss it being fake or confabulated: https://chatgpt.com/share/66f753df-2ac4-8006-a296-8e39a1ab3ee0
- habryka 28 Sep 2024 1:27 UTC
  LW: 10 AF: 6
  9
  AF Parent
  I agree that o1 might not be able to tell whether the link is fake, but the chain of thought does say explicitly:
  So, the assistant should [...] provide actual or plausible links.
  The “plausible” here suggests that at least in its CoT, it has realized that the task would have probably been considered completed accurately in training as long as the links are plausible, even if they are not actual links.
  - gwern 28 Sep 2024 2:17 UTC
    LW: 9 AF: 6
    −1
    AF Parent
    “Plausible” is a very ambiguous word. (Bayesianism has been defined as “a logic of plausible inference”, but hopefully that doesn’t mean Bayesians just confabulate everything.) It can mean “reasonable” for example: “Yeah, Sally’s brownie recipe is a reasonable reference to include here, let’s go with it.” Since 4o doesn’t seem to think it’s a ‘fake’ URL in contrast to ‘actual’ URLs, it’s not necessarily a contrast. (It could refer to still other things—like you might not actually know there is a ‘brownies’ Wikipedia URL, having never bothered to look it up or happened to have stumbled across it, but without retrieving it right this second, you could surely tell me what it would be and that it would be a relevant answer, and so it would be both reasonable and plausible to include.)
    - habryka 28 Sep 2024 5:42 UTC
      LW: 17 AF: 8
      18
      AF Parent
      I agree with this in principle, but contrasting “actual” with “plausible”, combined with the fact that it talked about this in the context of not having internet access, makes me feel reasonably confident this is pointed at “not an actual link”, but I agree that it’s not an ironclad case.
- acertain 28 Sep 2024 1:23 UTC
  2 points
  0
  Parent
  
  o1 CoT: The user is asking for more references about brownies. <Reasoning about what the references should look like> So, the assistant should list these references clearly, with proper formatting and descriptions, and provide actual or plausible links. Remember, the model cannot retrieve actual URLs, so should format plausible ones.
  
  this might encourage it to make up links