johnswentworth comments on Alignment: “Do what I would have wanted you to do”

johnswentworth 13 Jul 2024 16:26 UTC
3 points
10
No, because we have tons of information about what specific kinds of information on the internet is/isn’t usually fabricated. It’s not like we have no idea at all which internet content is more/less likely to be fabricated.
Information about, say, how to prove that there are infinitely many primes is probably not usually fabricated. It’s standard basic material, there’s lots of presentations of it, it’s not the sort of thing which people usually troll about. Yes, the distribution of internet text about the infinitude of primes contains more-than-zero trolling and mistakes and the like, but that’s not the typical case, so low-temperature sampling from the LLM should usually work fine for that use-case.
On the other end of the spectrum, “fusion power plant blueprints” on the internet today will obviously be fictional and/or wrong, because nobody currently knows how to build a fusion power plant which works. This generalizes to most use-cases in which we try to get an LLM to do something (using only prompting on a base model) which nobody is currently able to do. Insofar as the LLM is able to do such things, that actually reflects suboptimal next-text prediction on its part.