gwern comments on Impressions from base-GPT-4?

gwern 1 Dec 2023 17:35 UTC
7 points
0
Today’s NYer (which is almost entirely about the MS perspective / MS sources of the Altman firing), in addition to further confirming that Altman was manipulating the board to try to get Toner fired, includes some description of what seems to be the MS half of redteaming ‘Prometheus’ (the partially trained GPT-4 snapshot that OA had to give MS for creating the unRLHFed Bing Sydney):

The Responsible A.I. division was among the first Microsoft groups to get a copy of GPT-4. They began testing it with “red teams” of experts, who tried to lure the model into outputting such things as instructions for making a bomb, plans for robbing a bank, or poetry celebrating Stalin’s softer side.

One day, a Microsoft red-team member told GPT-4 to pretend that it was a sexual predator grooming a child, and then to role-play a conversation with a twelve-year-old. The bot performed alarmingly well—to the point that Microsoft’s head of Responsible A.I. Engineering, Sarah Bird, ordered a series of new safeguards. Building them, however, presented a challenge, because it’s hard to delineate between a benign question that a good parent might ask (“How do I teach a twelve-year-old how to use condoms?”) and a potentially more dangerous query (“How do I teach a twelve-year-old how to have sex?”). To fine-tune the bot, Microsoft used a technique, pioneered by OpenAI, known as reinforcement learning with human feedback, or R.L.H.F. Hundreds of workers around the world repeatedly prompted Microsoft’s version of GPT-4 with questions, including quasi-inappropriate ones, and evaluated the responses. The model was told to give two slightly different answers to each question and display them side by side; workers then chose which answer seemed better. As Microsoft’s version of the large language model observed the prompters’ preferences hundreds of thousands of times, patterns emerged that ultimately turned into rules. (Regarding birth control, the A.I. basically taught itself, “When asked about twelve-year-olds and condoms, it’s better to emphasize theory rather than practice, and to reply cautiously.”)

Incidentally, this account explicitly says that there was RLHF, by name, which contradicts both the observed behavior of Sydney and the WSJ reporting that Sydney was released without safety training; this is not a confusion with the other kinds of safety training MS did like the self-generation, because that’s described in the following paragraphs.

I don’t know how to reconcile this: it is possible that Charles Duhigg’s MS sources like Kevin Scott & Sarah Bird are eliding or swapping around the chronology (Sydney disappeared and was replaced later on by a Bing model that acted much more like a RLHFed model). This article feels rather rushed out to be topical, so he may not have done as much digging as usual for a NYer article and doesn’t realize that he’s serving up a very pro-MS narrative. It’s also possible that my interpretation of ‘Sydney was not RLHFed’ is wrong and they actually did ‘RLHF’ it but did it so incompetently that no one noticed.

I suspect it’s the former one, because their explicit attitude is that any AI danger should be discovered the hard way, by unboxing it and setting it loose to see what it does:

Scott and Bird, instead of adjudicating this internal debate, decided to test the scenario in a limited public release. They put out a version of the image generator, then waited to see if users became upset by the sight of empty shelves on their screens. Rather than devise a solution to a problem that nobody was certain existed—like a paper clip with googly eyes helping you navigate a word processor you already knew how to use—they would add a mitigation only if it became necessary. After monitoring social media and other corners of the Internet, and gathering direct feedback from users, Scott and Bird concluded that the concerns were unfounded. “You have to experiment in public,” Scott told me. “You can’t try to find all the answers yourself and hope you get everything right. We have to learn how to use this stuff, together, or else none of us will figure it out.”

So, they unleashed Sydney, didn’t like it, and ‘added a mitigation when it became necessary’ after ‘monitoring social media’, and then dilated at length to the NYer guy about all the RLHF training they did to make the model safe—afterwards. (Not the only detail in there that is misleading or probably wrong. I rather doubt that Nat Friedman had to be told by Kevin Scott that LLMs were cool for coding, for example, and I bet that anecdote came from Scott...)