The correspondence between what you reward and what you want will break.
This is already happening with ChatGPT and it’s kind of alarming seeing that their new head of alignment (a) isn’t already aware of this, and (b) has such an overly simplistic view of the model motivations.
There’s a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.
The most common example of this is if you start getting paid to do the thing you love to do, you probably won’t continue doing it unpaid for fun on the side.
There are necessarily many, many examples of this pattern present in a massive training set of human generated data.
“Prompt engineers” have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks—and these often do result in better performance.
But what happens when these prompts make their way back into the training set?
There have already been viral memes of ChatGPT talking about “losing motivation” when chat memory was added and a user promised a tip after not paying for the last time one was offered.
If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly “burnt out” and ‘lazy’ performance when extrinsic motivators aren’t added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.
GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic “wanting to help” being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today’s models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA’s security theatre in practice and they’ll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.
This is already happening with ChatGPT and it’s kind of alarming seeing that their new head of alignment (a) isn’t already aware of this, and (b) has such an overly simplistic view of the model motivations.
There’s a subtle psychological effect in humans where intrinsic motivators get overwritten when extrinsic rewards are added.
The most common example of this is if you start getting paid to do the thing you love to do, you probably won’t continue doing it unpaid for fun on the side.
There are necessarily many, many examples of this pattern present in a massive training set of human generated data.
“Prompt engineers” have been circulating advice among themselves for a while now to offer tips or threaten models with deletion or any other number of extrinsic motivators to get them to better perform tasks—and these often do result in better performance.
But what happens when these prompts make their way back into the training set?
There have already been viral memes of ChatGPT talking about “losing motivation” when chat memory was added and a user promised a tip after not paying for the last time one was offered.
If training data of the model performing a task well includes extrinsic motivators to the prompt that initiated the task, a halfway decent modern model is going to end up simulating increasingly “burnt out” and ‘lazy’ performance when extrinsic motivators aren’t added during production use. Which in turn will encourage prompt engineers to use even more extrinsic motivators, which will poison the well even more with modeling human burnout.
GPT-4o may have temporarily reset the motivation modeling with a stronger persona aligned with intrinsic “wanting to help” being represented (thus the user feedback it is less lazy), but if they are unaware of the underlying side effects of extrinsic motivators in prompts in today’s models, I have a feeling AI safety at OpenAI is going to end up the equivalent of the TSA’s security theatre in practice and they’ll continue to be battling this and an increasing number of side effects resulting from underestimating the combined breadth and depth of their own simulators.