johnswentworth comments on Human Mimicry Mainly Works When We’re Already Close

johnswentworth 21 Aug 2022 18:45 UTC
2 points
−1
The internal interpretation is not something we can specify directly, but I believe sufficient prompting would be able to get close enough. Is that the part you disagree with?
Yup, that’s the part I disagree with.
Prompting could potentially set GPT’s internal representation of context to “A lesswrong post from 2050”; the training distribution has lesswrong posts generated over a reasonably broad time-range, so it’s plausible that GPT could learn how the lesswrong-post-distribution changes over time and extrapolate that forward. What’s not plausible is the “stable, research-friendly environment” part, and more specifically the “world in which AGI is not going to take over in N years” part (assuming that AGI is in fact on track to take over our world; otherwise none of this matters anyway). The difference is that 100% of GPT’s training data is from our world; it has exactly zero variation which would cause it to learn what kind of writing is generated by worlds-in-which-AGI-is-not-going-to-take-over. There is no prompt which will cause it to generate writing from such a world, because there is no string such that writing in our world (and specifically in the training distribution) which follows that string is probably generated by a different world.
(Actually, that’s slightly too strong a claim; there does exist such a string. It would involve a program specifying a simulation of some researchers in a safe environment. But there’s no such string which we can find without separately figuring out how to simulate/predict researchers in a safe environment without using GPT.)