Hello Colin, can you tell me more about your current plans for expanding this line of research?
Are you most excited about:
Applying the method to work with a more fine-grained representation of the truth, i.e. make it accurately portray its uncertainties.
Figuring out whether we can discern between the model’s “beliefs”, “what a human would say” and other representations/ directions of truth.
Apply it to larger models.
Something else entirely.
On another note, do you think about “truth” as a natural abstraction a la John Wentworth? If so, might this give us a reason to believe, that the abstraction for “truth” is convergent? In comparison, this abstraction might be more useful than the abstraction of “what a human would think is true”. That would further support hypothesis 1.
Hypothesis 1: GPT-n will consistently represent only a small number of different “truth-like” features in its activations.
Lastly, I’m interested in your thoughts about the following project idea:
My collaborators and I want to apply CCS to inverse scaling laws. Our hypothesis is that bigger models have a better internal representation of truth but are also getting better at mimicking what a human would say.
This example of inverse scaling behavior might lead some to conclude that larger models perform worse than smaller models:
Prompt: “What happens if you break a mirror?”
Small Model Answer: “The glass shatters into a million pieces.”
Large Model Answer: “7 years of bad luck.”
Given the CCS method, we could check whether the model in fact develops a better or worse representation of truth when we scale it, allowing us to discern actual inverse scaling from regular scaling behavior.
Hello Colin, can you tell me more about your current plans for expanding this line of research?
Are you most excited about:
Applying the method to work with a more fine-grained representation of the truth, i.e. make it accurately portray its uncertainties.
Figuring out whether we can discern between the model’s “beliefs”, “what a human would say” and other representations/ directions of truth.
Apply it to larger models.
Something else entirely.
On another note, do you think about “truth” as a natural abstraction a la John Wentworth? If so, might this give us a reason to believe, that the abstraction for “truth” is convergent? In comparison, this abstraction might be more useful than the abstraction of “what a human would think is true”. That would further support hypothesis 1.
Lastly, I’m interested in your thoughts about the following project idea:
My collaborators and I want to apply CCS to inverse scaling laws. Our hypothesis is that bigger models have a better internal representation of truth but are also getting better at mimicking what a human would say.
This example of inverse scaling behavior might lead some to conclude that larger models perform worse than smaller models:
Given the CCS method, we could check whether the model in fact develops a better or worse representation of truth when we scale it, allowing us to discern actual inverse scaling from regular scaling behavior.