I am certainly not an expert, but what I have read in the field of AI alignment is almost solely targeted at handling the question of how to make sure that a general AI does not start optimizing for outcomes that humans do not want to happen.
This is, to me, an obviously important problem, in that it seems like we only get one chance at it and if we fail we are all Doomed.
However, it also seems like a very difficult problem to get good feedback on since there are no AGIs to test different strategies on, which can make estimating the promise of different avenues of research very difficult.
A related problem in which real world feedback can be more readily attained might be alignment of narrow AIs. It seems to me that existing AIs already demonstrate properties that appear to be misaligned with the intentions of their programmers. For example, websites such as Facebook and YouTube employ algorithms to personalize the recommended content a user sees to keep them engaged and thus generating the companies more ad revenue. On the face of it, there seems to be nothing wrong with optimizing to show people what they want to see. Unfortunately, a very effective strategy for accomplishing this goal in many cases appears to be showing people extremely misleading or fabricated stories about reality which both support what the person already believes and creates a large emotional response. In this way these platforms support the spread of misinformation and political polarization, which was almost certainly not the intent of the programmers who probably imagined themselves to be creating a service that would be uniquely tailored to be enjoyable for everyone.
This problem does not seem to be something that can be simply patched, much in the way deficient friendly AI designs do not seem patchable. It seems, at least when I try to consider a possible solution, impossible, though not as impossible an impossibility as aligned AGI. The key immediate problems in my mind are “how do you decide what counts as misinformation in a relatively unbiased manner?” and “if you don’t fulfill people’s desire for confirmation of their poorly supported beliefs, how do you make your platform as attractive as a competitor who will fill that confirmation bias itch?”
Tackling the challenge of “build an AI system that recommends content to users without spreading misinformation” or something similar seems as though it could provide a reasonably relevant test for ideas for outer alignment of AI systems. There are clearly a number of relevant differences between this kind of test and actual AGI, but the fact that fields without concrete ways to test their ideas have historically not done well seems a very important consideration.
So are there groups doing this sort of research, and if not is there a good reason why such research would not be particularly useful for the AI alignment field?
A selection of stuff just from the Alignment Newsletter (which probably misses most of the work):
From Optimizing Engagement to Measuring Value
Aligning AI to Human Values means Picking the Right Metrics
Designing Recommender Systems to Depolarize
What are you optimizing for? Aligning Recommender Systems with Human Values
Aligning Recommender Systems as Cause Area (Not in the newsletter since I don’t find it persuasive, I wrote about why in a comment on the post)
Learning to Summarize with Human Feedback (and its predecessor, Fine-Tuning GPT-2 from Human Preferences)
Of course tons of this research is going on. Do you think people who work at Facebook or YouTube are happy that their algorithms suggest outrageous or misleading content? I know a bit about the work at YouTube (I work on an unrelated applied AI team at Google) and they are altering metrics, penalizing certain kinds of content, looking for user journeys that appear to have undesirable outcomes and figuring out what to do about them, and so on.
I’m also friends with a political science professor who consults with Facebook on similar kinds of issues, basically applying mechanism design to think about how people will act given different kinds of things in their feeds.
Also you can think about spam or abuse with AI systems, which have similar patterns. If someone figures out how to trick the quality rating system for ads into thinking this is a high quality ad, then they’ll get a discount (this is how e.g. Google search ads works). All kinds of tricky things happen, with web pages showing one thing to the ad system and a different thing to the user, for instance, or selling one thing to harvest emails to spam for something else.
In general the observation from working in the field is that if you have a simple metric, people will figure out how to game it. So you need to build in a lot of safeguards, and you need to evolve all the time as the spammers/abusers evolve. There’s no end point, no place where you think you’re done, just an ever changing competition.
I’m not sure this provides much comfort for the AGI alignment folks....
That’s what I was trying to point at in regards to the problem not being patchable. It doesn’t seem like there is some simple patch you can write, and then be done. A solution that would work more permanently seems to have some of the “impossible” character of AGI alignment and trying to solve it on that level seems like it could be valuable for AGI alignment researchers.
Yeah. Just off the top of my head, OpenAI’s safety group has put some work into language models (e.g.), and there’s a new group called Preamble working on helping recommender systems meet certain desiderata.
Also see the most recent post on LW from Beth Barnes.