Recently @Joseph Bloom was showing me Neuronpedia which catalogues features found in GPT-2 by sparse autoencoders, and there were many features which were semantically coherent, but I couldn’t find a word in any of the languages I spoke that could point to these concepts exactly. It felt a little bit like how human languages often have words that don’t translate, and this made us wonder whether we could learn useful abstractions about the world (e.g. that we actually import into English) by identifying the features being used by LLMs.
I was going to ask for interesting examples. But perhaps we can do even better and choose examples with the highest value of… uhm… something.
I am just wildly guessing here, but it seems to me that if these features are somehow implied by the human text, the ones that are “implied most strongly” could be the most interesting ones. Unless they are just random artifacts of the process of learning.
If we trained the LLM using the same text database, but randomly arranged the sources, or otherwise introduced some noise, would the same concepts appear?
Recently @Joseph Bloom was showing me Neuronpedia which catalogues features found in GPT-2 by sparse autoencoders, and there were many features which were semantically coherent, but I couldn’t find a word in any of the languages I spoke that could point to these concepts exactly. It felt a little bit like how human languages often have words that don’t translate, and this made us wonder whether we could learn useful abstractions about the world (e.g. that we actually import into English) by identifying the features being used by LLMs.
I was going to ask for interesting examples. But perhaps we can do even better and choose examples with the highest value of… uhm… something.
I am just wildly guessing here, but it seems to me that if these features are somehow implied by the human text, the ones that are “implied most strongly” could be the most interesting ones. Unless they are just random artifacts of the process of learning.
If we trained the LLM using the same text database, but randomly arranged the sources, or otherwise introduced some noise, would the same concepts appear?
So I’m not sure how well the word “invent” fits here, but I think it’s safe to say LLMs have concepts that we do not.