I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.
Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.
I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.
It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional “pen-test” type bounties, doing runtime testing of existing live applications. But I’ve long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
Fair enough.
My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.
Yeah personally building feels more natural to me.
I agree a leaderboard would be great. I think it’d be cool to have a leaderboard for proposals as well—“this proposal has been unbroken for X days” seems like really valuable information that’s not currently being collected.
I don’t think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.
In the absence of coordination, I think if someone like you was to simply start advertising themselves as an “uberbreaker” who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a “pre-Eliezer” person who I can run my ideas by in a lower stakes context, as opposed to saying “Hey Eliezer, I solved alignment—wallop me if I’m wrong!”
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.
I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.
Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.
I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.
It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional “pen-test” type bounties, doing runtime testing of existing live applications. But I’ve long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.
Fair enough.
Yeah personally building feels more natural to me.
I agree a leaderboard would be great. I think it’d be cool to have a leaderboard for proposals as well—“this proposal has been unbroken for X days” seems like really valuable information that’s not currently being collected.
I don’t think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.
In the absence of coordination, I think if someone like you was to simply start advertising themselves as an “uberbreaker” who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a “pre-Eliezer” person who I can run my ideas by in a lower stakes context, as opposed to saying “Hey Eliezer, I solved alignment—wallop me if I’m wrong!”
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.