I suspect you use the word “opaque” in a different way than Eliezer Yudkowsky here. At least I fail to see from your summary, how this would contradict my interpretation of Eliezer’s statement (and your title and introduction seems to imply that it is a contradiction).
Consider the hypothetical example, where GPT-3 states (incorrectly) that Geneva is the capital of Switzerland. Can we look at the weights of GPT-3 and see if it was just playing dumb or if it genuinely thinks that Geneva is the capital of Switzerland?
If the weights/”matrices”/”giant wall of floating point numbers” are opaque (in the sense of Eliezer according to my guess), then we would look at it and shrug our shoulders.
I fail to see from your summary, how the effective theories would help in this example.
(Disclaimer: In this specific example or similar examples, I would not be surprised if it was actually possible to figure out if it was playing dumb or what caused GPT-3 to play dumb. Also I do not expect GPT-3 to actually believe that Geneva is the capital of Switzerland).
My guess of your meaning of “opaque” would be something like “we have no idea why deep learning works at all” or “we have no mathematical theory for the training of neural nets”, which your summary disproves.
Hmm, you may be right, sorry. I somehow read the opaqueness problem as a sub-problem of lie-detection. To do lie-detection we need to formulate mathematically what lying means, and for that we need theoretical understanding of what’s going on in a neural net in the first place, so we have the right concepts to work with.
I think lie-detection in general is very hard, although it might be tractable in specific cases. The general problem seems hard because I find it difficult to define lying mathematically. Thinking about it for five minutes I hit several dead ends. The “best” one was this: If the agent (for lack of a better term) lies, it would not be surprised about a contrary outcome. That is, I think it would be a bad sign if the agent wasn’t surprised to find me dead tomorrow, despite stating the contrary. And surprisal is something that we have an information-theoretical handle on. However, even if we could design the agent such that we can feed it with such input that it actually “believes” it is tomorrow and I am dead (even though it is today and I am still alive), we would still need to distinguish surprisal about the fact that I’m dead and surprisal about the way the operator has formulated the question or any other thing. (A clever agent might expect the operator to ask this question and deliberately forget that one can ask the question in this particular way, so it’d be surprised to hear this formulation, etc.) The latter issue might become more tractable now that we better understand how and why representations are forming, so we could potentially distinguish surprisal about form and surprisal about content. I still see this as a probable dead end because of the “make it believe” part. If a solution exists, I expect it to be specific to a particular agent architecture.
The latter issue might become more tractable now that we better understand how and why representations are forming, so we could potentially distinguish surprisal about form and surprisal about content.
I would count that as substantial progress on the opaqueness problem.
Indeed, it does seem possible to figure out where simple factual information is stored in the weights of a LLM, and to distinguish between knowing whether it “knows” a fact versus it simply parroting a fact.
I suspect you use the word “opaque” in a different way than Eliezer Yudkowsky here. At least I fail to see from your summary, how this would contradict my interpretation of Eliezer’s statement (and your title and introduction seems to imply that it is a contradiction).
Consider the hypothetical example, where GPT-3 states (incorrectly) that Geneva is the capital of Switzerland. Can we look at the weights of GPT-3 and see if it was just playing dumb or if it genuinely thinks that Geneva is the capital of Switzerland? If the weights/”matrices”/”giant wall of floating point numbers” are opaque (in the sense of Eliezer according to my guess), then we would look at it and shrug our shoulders. I fail to see from your summary, how the effective theories would help in this example. (Disclaimer: In this specific example or similar examples, I would not be surprised if it was actually possible to figure out if it was playing dumb or what caused GPT-3 to play dumb. Also I do not expect GPT-3 to actually believe that Geneva is the capital of Switzerland).
My guess of your meaning of “opaque” would be something like “we have no idea why deep learning works at all” or “we have no mathematical theory for the training of neural nets”, which your summary disproves.
Hmm, you may be right, sorry. I somehow read the opaqueness problem as a sub-problem of lie-detection. To do lie-detection we need to formulate mathematically what lying means, and for that we need theoretical understanding of what’s going on in a neural net in the first place, so we have the right concepts to work with.
I think lie-detection in general is very hard, although it might be tractable in specific cases. The general problem seems hard because I find it difficult to define lying mathematically. Thinking about it for five minutes I hit several dead ends. The “best” one was this: If the agent (for lack of a better term) lies, it would not be surprised about a contrary outcome. That is, I think it would be a bad sign if the agent wasn’t surprised to find me dead tomorrow, despite stating the contrary. And surprisal is something that we have an information-theoretical handle on. However, even if we could design the agent such that we can feed it with such input that it actually “believes” it is tomorrow and I am dead (even though it is today and I am still alive), we would still need to distinguish surprisal about the fact that I’m dead and surprisal about the way the operator has formulated the question or any other thing. (A clever agent might expect the operator to ask this question and deliberately forget that one can ask the question in this particular way, so it’d be surprised to hear this formulation, etc.) The latter issue might become more tractable now that we better understand how and why representations are forming, so we could potentially distinguish surprisal about form and surprisal about content. I still see this as a probable dead end because of the “make it believe” part. If a solution exists, I expect it to be specific to a particular agent architecture.
I would count that as substantial progress on the opaqueness problem.
To be clear: I don’t have strong confidence that this works, but I think this is something worth exploring.
Indeed, it does seem possible to figure out where simple factual information is stored in the weights of a LLM, and to distinguish between knowing whether it “knows” a fact versus it simply parroting a fact.