tailcalled comments on Symbol/Referent Confusions in Language Model Alignment Experiments

tailcalled 26 Oct 2023 20:07 UTC
4 points
−3
Copying my responses from the original thread.
1: 🤔 I think a challenge with testing the corrigibility of AI is that currently no AI system is capable of running autonomously. It’s always dependent on humans to decide to host it and query it, so you can always just e.g. pour a bucket of water on the computer running the query script to stop it. Of course force-stopping the AI may be economically unfavorable for the human, but that’s not usually considered the main issue in the context of corrigibility.
I usually find it really hard to imagine how the world will look like once economically autonomous AIs become feasible. If they become feasible, that is—while there are obviously places today where with better AI technology, autonomous AIs would be able to outcompete humans, it’s not obvious to me that autonomous AIs wouldn’t also be outcompeted by centralized human-controlled AIs. (After all, it could plausibly be more efficient to query some neural network running in a server somewhere than to bring the network with you on a computer to wherever the AI is operating, and in this case you could probably economically decouple running the server from running the AI.)
2: [John Wentworth’s post] has a lot of karma but not very much agreement, which is an interesting balance. I’m a contributor to this, having upvoted but not agree-voted, so I feel like I should say why I did that:
[The post] might be right! I mean, it is certainly right that [Zack] is doing a symbol/referent mixup, but you might also be right that it matters.
But you might also not be right that it matters? It seems to me that most of the value in LLMs comes when you ground the symbols in their conventional meaning, so by-default I would expect them to be grounded that way, and therefore by-default I would expect symbolic corrigibility to translate to actual corrigibility.
There are exceptions—sometimes I tell ChatGPT I’m doing one thing when really I’m doing something more complicated. But I’m not sure this would change a lot?
I think the way you framed the issue is excellent, crisp, and thought-provoking, but overall I don’t fully buy it.

tailcalled comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

tailcalled comments on Symbol/Referent Confusions in Language Model Alignment Experiments