More than anything else, it sounds like the RLHF from OpenAI responding to our culture’s general misogyny.
RLHF is not necessary to see these behaviors, the original post is not (only) on RLHFed models, mere predictive models of text are enough, as that’s what was studied here. One has to be quite careful to analyze the results of models like this strictly in terms of the causal process that generated the model; I’m a fan of epistemically careful psychoanalysis but it’s mighty rare, tools like this give the potential of highly careful psychoanalysis being actually possible as a form of large-scale mechinterp like the original post. And don’t lose track of the fact that AIs will have weird representation differences arising from differences in what’s natural for brains (3d asynchronous-spiking proteins-and-chemicals complex neurons, in a highly local recurrent brain, trained with simple local cost functions and evolution-pretrained context-dependent reinforcement-learning responses, which allow an organism to generate its own experiences by exploration), vs current AIs (simple rectified linear floating point neurons, in a synchronous self-attention network, trained by global backprop gradient descent on a fixed dataset). There’s a lot of similarity between humans and current AIs, but also a lot of difference—I wouldn’t assume that all people have the same stuff in the space between meanings as these models do. I do imagine it’s reasonably common.
RLHF is not necessary to see these behaviors, the original post is not (only) on RLHFed models, mere predictive models of text are enough, as that’s what was studied here. One has to be quite careful to analyze the results of models like this strictly in terms of the causal process that generated the model; I’m a fan of epistemically careful psychoanalysis but it’s mighty rare, tools like this give the potential of highly careful psychoanalysis being actually possible as a form of large-scale mechinterp like the original post. And don’t lose track of the fact that AIs will have weird representation differences arising from differences in what’s natural for brains (3d asynchronous-spiking proteins-and-chemicals complex neurons, in a highly local recurrent brain, trained with simple local cost functions and evolution-pretrained context-dependent reinforcement-learning responses, which allow an organism to generate its own experiences by exploration), vs current AIs (simple rectified linear floating point neurons, in a synchronous self-attention network, trained by global backprop gradient descent on a fixed dataset). There’s a lot of similarity between humans and current AIs, but also a lot of difference—I wouldn’t assume that all people have the same stuff in the space between meanings as these models do. I do imagine it’s reasonably common.