Something that I have observed about interacting with chatGPT is that if it makes a mistake, and you correct it, and it pushes back, it is not helpful to keep arguing with it. Basically an argument in the chat history serves as a prompt for argumentative behavior. It is better to start a new chat, and this second time attempt to explain the task in a way that avoids the initial misunderstanding.
I think it is important that as we write letters and counter-letters we keep in mind that every time we say “AI is definitely going to destroy humanity”, and this text ends up on the internet, the string “AI is definitely going to destroy humanity” very likely ends up in the training corpus of a future GPT, or at least can be seen by some future GPT that is allowed free access to the internet. All the associated media hype and podcast transcripts and interviews will likely end up in the training data as well.
The larger point is that these statistical models are in many ways mirrors of ourselves and the things we say, especially the things we say in writing and in public forums. The more we focus on the darkness, the darker these statistical mirrors become. It’s not just about Eliezer’s thoughtful point that the AI may not explicitly hate us nor love us but destroy us anyway. In some ways, every time we write about it we are increasing the training data for this possible outcome, and the more thoughtful and creative our doom scenarios, the more thoughtfully and creatively destructive our statistical parrots are likely to become.
An example of something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.
Actually I think the explicit content of the training data is a lot more important than whatever spurious artifacts may or may not hypothetically arise as a result of training. I think most of the AI doom scenarios that say “the AI might be learning to like curly wire shapes, even if these shapes are not explicitly in the training data nor loss function” are the type of scenario you just described, “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.“
The “accidental taste for curly wires” is a steel man position of the paperclip maximizer as I understand it. Eliezer doesn’t actually think anybody will be stupid enough to say “make as many paper clips as possible”, he worries somebody will set up the training process in some subtly incompetent way, and then aggressively lie about the fact that it likes curly wires until it is released, and it will have learned to hide from interpretability techniques.
I definitely believe alignment research is important, and I am heartened when I see high-quality, thoughtful papers on interpretability, RLHF, etc. But then I hear Eliezer worrying about absurdly convoluted scenarios of minimal probability, and I think wow, that is “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it”, and it’s not just a waste of time, he wants to shut down the GPU clusters and cancel the greatest invention humanity ever built, all over “salt in the pasta water”.
Was referring to “let’s not post ideas in case an AGI later reads the post and decides to act on it”. Either we built stable tool systems who are unable to act in that way (see CAIS) or we are probably screwed so whatever. Also even if you suppress yourself if an AGI is looking for ideas on badness it can probably derive anything necessary to solve the problem.
Something that I have observed about interacting with chatGPT is that if it makes a mistake, and you correct it, and it pushes back, it is not helpful to keep arguing with it. Basically an argument in the chat history serves as a prompt for argumentative behavior. It is better to start a new chat, and this second time attempt to explain the task in a way that avoids the initial misunderstanding.
I think it is important that as we write letters and counter-letters we keep in mind that every time we say “AI is definitely going to destroy humanity”, and this text ends up on the internet, the string “AI is definitely going to destroy humanity” very likely ends up in the training corpus of a future GPT, or at least can be seen by some future GPT that is allowed free access to the internet. All the associated media hype and podcast transcripts and interviews will likely end up in the training data as well.
The larger point is that these statistical models are in many ways mirrors of ourselves and the things we say, especially the things we say in writing and in public forums. The more we focus on the darkness, the darker these statistical mirrors become. It’s not just about Eliezer’s thoughtful point that the AI may not explicitly hate us nor love us but destroy us anyway. In some ways, every time we write about it we are increasing the training data for this possible outcome, and the more thoughtful and creative our doom scenarios, the more thoughtfully and creatively destructive our statistical parrots are likely to become.
This is the https://www.lesswrong.com/posts/LHAJuYy453YwiKFt5/the-salt-in-pasta-water-fallacy
An example of something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.
Actually I think the explicit content of the training data is a lot more important than whatever spurious artifacts may or may not hypothetically arise as a result of training. I think most of the AI doom scenarios that say “the AI might be learning to like curly wire shapes, even if these shapes are not explicitly in the training data nor loss function” are the type of scenario you just described, “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.“
The “accidental taste for curly wires” is a steel man position of the paperclip maximizer as I understand it. Eliezer doesn’t actually think anybody will be stupid enough to say “make as many paper clips as possible”, he worries somebody will set up the training process in some subtly incompetent way, and then aggressively lie about the fact that it likes curly wires until it is released, and it will have learned to hide from interpretability techniques.
I definitely believe alignment research is important, and I am heartened when I see high-quality, thoughtful papers on interpretability, RLHF, etc. But then I hear Eliezer worrying about absurdly convoluted scenarios of minimal probability, and I think wow, that is “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it”, and it’s not just a waste of time, he wants to shut down the GPU clusters and cancel the greatest invention humanity ever built, all over “salt in the pasta water”.
Was referring to “let’s not post ideas in case an AGI later reads the post and decides to act on it”. Either we built stable tool systems who are unable to act in that way (see CAIS) or we are probably screwed so whatever. Also even if you suppress yourself if an AGI is looking for ideas on badness it can probably derive anything necessary to solve the problem.