I agree that as users of a black box app, it makes sense to think this way. In particular, I’m a fan of thinking of what ChatGPT does in literary terms.
But I don’t think it results in satisfying explanations of what it’s doing. Ideally, we wouldn’t settle for fan theories of what it’s doing, we’d have some kind of debug access that lets us see how it does it.
I find that explanation unsatisfying because it doesn’t help with other questions I have about how well ChatGPT works:
How does the language model represent countries and cities? For example, does it know which cities are near each other? How well does it understand borders?
Are there any capitals that it gets wrong? Why?
How well does it understand history? Sometimes a country changes its capital. Does it represent this fact as only being true at some times?
What else can we expect it to do with this fact? Maybe there are situations where knowing the capital of France helps it answer a different question?
These aren’t about a single prompt, they’re about how well its knowledge generalizes to other prompts, and what’s going to happen when you go beyond the training data. Explanations that generalize are more interesting than one-off explanations of a single prompt.
Knowing the right answer is helpful, but it only helps you understand what it will do if you assume it never makes mistakes. There are situations (like Clever Hans) where the way the horse got the right answer is actually pretty interesting. Or consider knowing that visual AI algorithms rely on textures more than shape (though this is changing).
Do you realize that you’re arguing against curiosity? Understanding hidden mechanisms is inherently interesting and useful.
I agree that as users of a black box app, it makes sense to think this way. In particular, I’m a fan of thinking of what ChatGPT does in literary terms.
But I don’t think it results in satisfying explanations of what it’s doing. Ideally, we wouldn’t settle for fan theories of what it’s doing, we’d have some kind of debug access that lets us see how it does it.
I think the best explanation of why ChatGPT responds “Paris” when asked “What’s the capital of France?” is that Paris is the capital of France.
I find that explanation unsatisfying because it doesn’t help with other questions I have about how well ChatGPT works:
How does the language model represent countries and cities? For example, does it know which cities are near each other? How well does it understand borders?
Are there any capitals that it gets wrong? Why?
How well does it understand history? Sometimes a country changes its capital. Does it represent this fact as only being true at some times?
What else can we expect it to do with this fact? Maybe there are situations where knowing the capital of France helps it answer a different question?
These aren’t about a single prompt, they’re about how well its knowledge generalizes to other prompts, and what’s going to happen when you go beyond the training data. Explanations that generalize are more interesting than one-off explanations of a single prompt.
Knowing the right answer is helpful, but it only helps you understand what it will do if you assume it never makes mistakes. There are situations (like Clever Hans) where the way the horse got the right answer is actually pretty interesting. Or consider knowing that visual AI algorithms rely on textures more than shape (though this is changing).
Do you realize that you’re arguing against curiosity? Understanding hidden mechanisms is inherently interesting and useful.