Eliezer responds, then the discussion goes off the rails in the usual ways. At this point I think attempts to have text interactions between the usual suspects on this are pretty doomed to fall into these dynamics over and over again.
Really? Based on a couple of threads? On twitter? With non-zero progress?
My model says that if you train a model using current techniques, of course exactly this happens. The AI will figure out how to react in the ways that cause people to evaluate it well on the test set, and do that. That does not generalize to some underlying motivational structure the way you would like. That does not do what you want out of distribution.
Really? Based on a couple of threads? On twitter? With non-zero progress?
GPT-4 does what you want out of distribution.