Actually I think the explicit content of the training data is a lot more important than whatever spurious artifacts may or may not hypothetically arise as a result of training. I think most of the AI doom scenarios that say “the AI might be learning to like curly wire shapes, even if these shapes are not explicitly in the training data nor loss function” are the type of scenario you just described, “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.“
The “accidental taste for curly wires” is a steel man position of the paperclip maximizer as I understand it. Eliezer doesn’t actually think anybody will be stupid enough to say “make as many paper clips as possible”, he worries somebody will set up the training process in some subtly incompetent way, and then aggressively lie about the fact that it likes curly wires until it is released, and it will have learned to hide from interpretability techniques.
I definitely believe alignment research is important, and I am heartened when I see high-quality, thoughtful papers on interpretability, RLHF, etc. But then I hear Eliezer worrying about absurdly convoluted scenarios of minimal probability, and I think wow, that is “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it”, and it’s not just a waste of time, he wants to shut down the GPU clusters and cancel the greatest invention humanity ever built, all over “salt in the pasta water”.
Was referring to “let’s not post ideas in case an AGI later reads the post and decides to act on it”. Either we built stable tool systems who are unable to act in that way (see CAIS) or we are probably screwed so whatever. Also even if you suppress yourself if an AGI is looking for ideas on badness it can probably derive anything necessary to solve the problem.
Actually I think the explicit content of the training data is a lot more important than whatever spurious artifacts may or may not hypothetically arise as a result of training. I think most of the AI doom scenarios that say “the AI might be learning to like curly wire shapes, even if these shapes are not explicitly in the training data nor loss function” are the type of scenario you just described, “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it.“
The “accidental taste for curly wires” is a steel man position of the paperclip maximizer as I understand it. Eliezer doesn’t actually think anybody will be stupid enough to say “make as many paper clips as possible”, he worries somebody will set up the training process in some subtly incompetent way, and then aggressively lie about the fact that it likes curly wires until it is released, and it will have learned to hide from interpretability techniques.
I definitely believe alignment research is important, and I am heartened when I see high-quality, thoughtful papers on interpretability, RLHF, etc. But then I hear Eliezer worrying about absurdly convoluted scenarios of minimal probability, and I think wow, that is “something that technically makes a difference but in practice the marginal gain is so negligible you are wasting time to even consider it”, and it’s not just a waste of time, he wants to shut down the GPU clusters and cancel the greatest invention humanity ever built, all over “salt in the pasta water”.
Was referring to “let’s not post ideas in case an AGI later reads the post and decides to act on it”. Either we built stable tool systems who are unable to act in that way (see CAIS) or we are probably screwed so whatever. Also even if you suppress yourself if an AGI is looking for ideas on badness it can probably derive anything necessary to solve the problem.