A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it’s very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for “surprising behavior” rather than “failures” per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though… but maybe understanding models better is robustly good enough to outweight that?)
A possible way to convert money to progress on alignment: offering a large (recurring) prize for the most interesting failures found in the behavior of any (sufficiently-advanced) model. Right now I think it’s very hard to find failures which will actually cause big real-world harms, but you might find failures in a way which uncovers useful methodologies for the future, or at least train a bunch of people to get much better at red-teaming.
(For existing models, it might be more productive to ask for “surprising behavior” rather than “failures” per se, since I think almost all current failures are relatively uninteresting. Idk how to avoid inspiring capabilities work, though… but maybe understanding models better is robustly good enough to outweight that?)
I like this. Would this have to be publicly available models? Seems kind of hard to do for private models.
What kind of access might be needed to private models? Could there be a secure multi-party computation approach that is sufficient?
Ideas for defining “surprising”? If we’re trying to create a real incentive, people will want to understand the resolution criteria.