I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment.
This prediction feels like… it doesn’t play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)
On the other hand, sometimes Facebook’s newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there’s an echo chamber problem, people only see things they agree with. But from an individual customer’s perspective, that’s exactly what they (think they) want to see, they don’t know that there’s anything wrong with the information they’re receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that’s a much more economically stable state; Facebook is less eager to switch to a new metric.
… but even that isn’t a real example of a problem which is properly invisible. It’s still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don’t notice at all, or don’t know to attribute to the newsfeed algorithm at all. We don’t have a widely-recognized example of such a thing and probably won’t any time soon, precisely because most people do not notice it. Yet I’d be surprised if Facebook’s newsfeed algorithm didn’t have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.
If anything, I’d expect iterating on visible problems to produce additional subtle problems—for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that’s much harder to detect, because it’s wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn’t make the echo chamber problem less bad, but it does make the echo chamber problem less visible.
Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that’s not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren’t visible.
I can’t tell which of two arguments you’re making: that there are unknown unknowns, or that myopia isn’t a complete solution.
This is a good argument for all metrics being Goodhearteable, and that if takeover occurs and the AI is incorrigible that’ll cause suboptimal value lock-in (Ie unknown unknowns).
I agree myopia isn’t a complete solution, but it seems better for preventing takeover risk than for preventing social media dysfunction? It seems more easily defineable in the worst case (“don’t do something nearly all humans really dislike” than “make the public square function well”).
I expect that people will find it pretty obvious that RLHF leads to somewhat misaligned systems, if they are widely used by the public. Like, I think that most ML researchers agree that the Facebook Newsfeed algorithm is optimizing for clicks in a way people are somewhat unhappy about, and this is based substantially on their personal experience with it; inasmuch as we’re interacting a lot with sort-of-smart ML systems, I think we’ll notice their slight misalignment. And so I do think that this will make AI takeover risk more obvious.
Examples of small AI catastrophes will also probably make takeover risk more obvious.
I guess another example of this phenomenon is that a bunch of people are more worried about AI takeover than they were five years ago, because they’ve seen more examples of ML systems being really smart, even though they wouldn’t have said five years ago that ML systems could never solve those problems. Seeing the ways things happen is often pretty persuasive to people.
This prediction feels like… it doesn’t play out the whole game tree? Like, yeah, Facebook releases one algorithm optimizing for clicks in a way people are somewhat unhappy about. But the customers are unhappy about it, which is not an economically-stable state of affairs, so shortly thereafter Facebook switches to a different metric which is less click-centric. (IIRC this actually happened a few years ago.)
On the other hand, sometimes Facebook’s newsfeed algorithm is bad in ways which are not visible to individual customers. Like, maybe there’s an echo chamber problem, people only see things they agree with. But from an individual customer’s perspective, that’s exactly what they (think they) want to see, they don’t know that there’s anything wrong with the information they’re receiving. This sort of problem does not actually look like a problem from the perspective of any one person looking at their own feed; it looks good. So that’s a much more economically stable state; Facebook is less eager to switch to a new metric.
… but even that isn’t a real example of a problem which is properly invisible. It’s still obvious that the echo-chamber-newsfeed is bad for other people, and therefore it will still be noticed, and Facebook will still be pressured to change their metrics. (Indeed that is what happened.) The real problems are problems people don’t notice at all, or don’t know to attribute to the newsfeed algorithm at all. We don’t have a widely-recognized example of such a thing and probably won’t any time soon, precisely because most people do not notice it. Yet I’d be surprised if Facebook’s newsfeed algorithm didn’t have some such subtle negative effects, and I very much doubt that the subtle problems will go away as the visible problems are iterated on.
If anything, I’d expect iterating on visible problems to produce additional subtle problems—for instance, in order to address misinformation problems, Facebook started promoting an Official Narrative which is itself often wrong. But that’s much harder to detect, because it’s wrong in a way which the vast majority of Official Sources also endorse. To put it another way: if most of the population can be dragged into a single echo chamber, all echoing the same wrong information, that doesn’t make the echo chamber problem less bad, but it does make the echo chamber problem less visible.
Anyway, zooming out: solve for the equilibrium, as Cowen would say. If the problems are visible to customers, that’s not a stable state. Organizations will be incentivized to iterate until problems stop being visible. They will not, however, be incentivized to iterate away the problems which aren’t visible.
I can’t tell which of two arguments you’re making: that there are unknown unknowns, or that myopia isn’t a complete solution.
This is a good argument for all metrics being Goodhearteable, and that if takeover occurs and the AI is incorrigible that’ll cause suboptimal value lock-in (Ie unknown unknowns).
I agree myopia isn’t a complete solution, but it seems better for preventing takeover risk than for preventing social media dysfunction? It seems more easily defineable in the worst case (“don’t do something nearly all humans really dislike” than “make the public square function well”).