Many! Thanks for sharing. This could easily turn into its own post.
In general, I think this is a great idea. I’m somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I’m generally supportive of trying as many different approaches as possible to see what works best.
Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:
The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren’t noticed by others. (Perhaps this points to the idea of having multiple ‘judges’ in an alignment tournament.)
Non-experts often have insights or perspectives that are surprising to security professionals; I’ve been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I’ve run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren’t easily broken, it’s going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it’s probably a worthwhile one.
As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they’re working on, and they’ve generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying “there is a desperate need for critical feedback”.)
I’d put myself in this category as well—I used to write a lot of posts and especially comments here on LW summarizing how I’d go about solving some aspect or another of the alignment problem, hoping that Cunningham’s Law would trigger someone to point out a flaw in my approach. (In some cases I’d already have a flaw in mind along with a way to address it, but I figured it’d be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)
Anyway, it seemed like people often didn’t take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I’m doing in my LW user profile now, I’ve had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn’t have expected this to be true—I’d expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they’ll have a different way of looking at things which could address my blind spots.
Lately I’ve developed a theory for what’s going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there’s an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who “solved AI alignment”, that’ll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn’t have an obvious status motive. In fact, if you go around shooting down peoples’ ideas, that’s liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren’t signaling any kind of special talent by shooting them down. Hence the “breaker” role ends up being undervalued/disincentivized. Especially doing anything beyond just saying “that won’t work”—finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don’t always find such handwaving persuasive.)
I think this might be why Eliezer feels so overworked. He’s staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he’s had a hard time replacing himself. I think maybe he’s tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it’s hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.
So yeah, I don’t know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a “useless” blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.
A few other thoughts...
it’s going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders
Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms—obviously not as good as a specific description, but still provides some signal to iterate on.
It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it’s probably a worthwhile one.
I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I’d expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).
I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.
Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.
I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.
It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional “pen-test” type bounties, doing runtime testing of existing live applications. But I’ve long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
Fair enough.
My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy.
Yeah personally building feels more natural to me.
I agree a leaderboard would be great. I think it’d be cool to have a leaderboard for proposals as well—“this proposal has been unbroken for X days” seems like really valuable information that’s not currently being collected.
I don’t think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.
In the absence of coordination, I think if someone like you was to simply start advertising themselves as an “uberbreaker” who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a “pre-Eliezer” person who I can run my ideas by in a lower stakes context, as opposed to saying “Hey Eliezer, I solved alignment—wallop me if I’m wrong!”
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.
Could potentially be up for playing red team against you, in exchange for you playing red team against me (but if I think I could have something to contribute as red team would depend on specifics of what is proposed/discussed—e.g., I’m not familiar with technical specifics of deep learning beyond vague descriptions).
I don’t have anything prepared for red teaming at the moment—I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)
And yes, do feel free to send me drafts in the future if you want me to look over them. I don’t give guaranties regarding amount or speed of feedback, but it would be my intention to try to be helpful :)
I wasn’t aware you were offering a bounty! I rarely check people’s profile pages unless I need to contact them privately, so it might be worth mentioning this at the beginning or end of posts where it might be relevant.
Fair point. I also haven’t done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren’t many archive readers.
Thanks for writing this! Do you have any thoughts on doing a red team/blue team alignment tournament as described here?
Many! Thanks for sharing. This could easily turn into its own post.
In general, I think this is a great idea. I’m somewhat skeptical that this format would generate deep insights; in my experience successful Capture the Flag / wargames / tabletop exercises work best in the form where each group spends a lot of time preparing for their particular role, but opsec wargames are usually easier to score, so the judge role makes less sense there. That said, in the alignment world I’m generally supportive of trying as many different approaches as possible to see what works best.
Prior to reading your post, my general thoughts about how these kind of adversarial exercises relate to the alignment world were these:
The industry thought leaders usually have experience as both builders and breakers; some insights are hard to gain from just one side of the battlefield. That said, the industry benefits from folks who spend the time becoming highly specialized in one role or the other, and the breaker role should be valued at least equally, if not more than the builder. (In the case of alignment, breakers may be the only source of failure data we can safely get.)
The most valuable tabletop exercises that I was a part of spent at least as much time analyzing the learnings as the exercise itself; almost everyone involved will have unique insights that aren’t noticed by others. (Perhaps this points to the idea of having multiple ‘judges’ in an alignment tournament.)
Non-experts often have insights or perspectives that are surprising to security professionals; I’ve been able to improve an incident response process based on participation from other teams (HR, legal, etc.) almost every time I’ve run a tabletop. This is probably less true for an alignment war game, because the background knowledge required to even understand most alignment topics is so vast and specialized.
Unknown unknowns are a hard problem. While I think we are a long way away from having builder ideas that aren’t easily broken, it’s going to be a significant danger to have breakers run out of exploit ideas and mistake that for a win for the builders.
Most tabletop exercises are focused on realtime response to threats. Builder/breaker war games like the DEFCON CTF are also realtime. It might be a challenge to create a similarly engaging format that allows for longer deliberation times on these harder problems, but it’s probably a worthwhile one.
Thanks for the reply!
As some background on my thinking here, last I checked there are a lot of people on the periphery of the alignment community who have some proposal or another they’re working on, and they’ve generally found it really difficult to get quality critical feedback. (This is based on an email I remember reading from a community organizer a year or two ago saying “there is a desperate need for critical feedback”.)
I’d put myself in this category as well—I used to write a lot of posts and especially comments here on LW summarizing how I’d go about solving some aspect or another of the alignment problem, hoping that Cunningham’s Law would trigger someone to point out a flaw in my approach. (In some cases I’d already have a flaw in mind along with a way to address it, but I figured it’d be more motivating to wait until someone mentioned a particular flaw in the simple version of the proposal before I mentioned the fix for it.)
Anyway, it seemed like people often didn’t take the bait. (Thanks to everyone who did!) Even with offering $1000 to change my view, as I’m doing in my LW user profile now, I’ve had 0 takers. I stopped posting on LW/AF nearly as much partially because it has seemed more efficient to try to shoot holes in my ideas myself. On priors, I wouldn’t have expected this to be true—I’d expect that someone else is going to be better at finding flaws in my ideas than I am myself, because they’ll have a different way of looking at things which could address my blind spots.
Lately I’ve developed a theory for what’s going on. You might be familiar with the idea that humans are often subconsciously motivated by the need to acquire & defend social status. My theory is that there’s an asymmetry in the motivations for alignment building & breaking work. The builder has an obvious status motive: If you become the person who “solved AI alignment”, that’ll be really good for your social status. That causes builders to have status-motivated blindspots around weak points in their ideas. However, the breaker doesn’t have an obvious status motive. In fact, if you go around shooting down peoples’ ideas, that’s liable to annoy them, which may hurt your social status. And since most proposals are allegedly easily broken anyways, you aren’t signaling any kind of special talent by shooting them down. Hence the “breaker” role ends up being undervalued/disincentivized. Especially doing anything beyond just saying “that won’t work”—finding a breaker who will describe a failure in detail instead of just vaguely gesturing seems really hard. (I don’t always find such handwaving persuasive.)
I think this might be why Eliezer feels so overworked. He’s staked a lot of reputation on the idea that AI alignment is a super hard problem. That gives him a unique status motive to play the red team role, which is why he’s had a hard time replacing himself. I think maybe he’s tried to compensate for this by making it low status to make a bad proposal, in order to browbeat people into self-critiquing their proposals. But this has a downside of discouraging the sharing of proposals in general, since it’s hard to predict how others will receive your ideas. And punishments tend to be bad for creativity.
So yeah, I don’t know if the tournament idea would have the immediate effect of generating deep insights. But it might motivate people to share their ideas, or generate better feedback loops, or better align overall status motives in the field, or generate a “useless” blacklist which leads to a deep insight, or filter through a large number of proposals to find the strongest ones. If tournaments were run on a quarterly basis, people could learn lessons, generate some deep ideas from those lessons, and spend a lot of time preparing for the next tournament.
A few other thoughts...
Perhaps we could mitigate this by allowing breakers to just characterize how something might fail in vague terms—obviously not as good as a specific description, but still provides some signal to iterate on.
I think something like a realtime Slack discussion could be pretty engaging. I think there is room for both high-deliberation and low-deliberation formats. [EDIT: You could also have a format in between, where the blue team gets little time, and the red team gets lots of time, to try to simulate the difference in intelligence between an AGI and its human operators.] Also, I’d expect even a slow, high-deliberation tournament format to be more engaging than the way alignment research often gets done (spend a bunch of time thinking on your own, write a post, observe post score, hopefully get a few good comments, discussion dies out as post gets old).
I think you make good points generally about status motives and obstacles for breakers. As counterpoints, I would offer:
Eliezer is a good example of someone who built a lot of status on the back of “breaking” others’ unworkable alignment strategies. I found the AI Box experiments especially enlightening in my early days.
There are lots of high-status breakers, and lots of independent status-rewarding communities around the security world. Some of these are whitehat/ethical, like leaderboards for various bug bounty programs, OWASP, etc. Some of them not so much so, like Blackhat/DEFCON in the early days, criminal enterprises, etc.
Perhaps here is another opportunity to learn lessons from the security community about what makes a good reward system for the breaker mentality. My personal feeling is that poking holes in alignment strategies is easier than coming up with good ones, but I’m also aware that thinking that breaking is easy is probably committing some quantity of typical mind fallacy. Thinking about how things break, or how to break them intentionally, is probably a skill that needs a lot more training in alignment. Or at least we need away to attract skilled breakers to alignment problems.
I find it to be a very natural fit to post bounties on various alignment proposals to attract breakers to them. Keep upping the bounty, and eventually you have a quite strong signal that a proposal might be workable. I notice your experience of offering a personal bounty does not support this, but I think there is a qualitative difference between a bounty leaderboard with public recognition and a large pipeline of value that can be harvested by a community of good breakers, and what may appear to be a one-off deal offered by a single individual with unclear ancillary status rewards.
It may be viable to simply partner with existing crowdsourced bounty program providers (e.g. BugCrowd) to offer a new category of bounty. Traditionally, these services have focused on traditional “pen-test” type bounties, doing runtime testing of existing live applications. But I’ve long been saying there should be a market for crowdsourced static analysis, and even design reviews, with a pay-per-flaw model.
Fair enough.
Yeah personally building feels more natural to me.
I agree a leaderboard would be great. I think it’d be cool to have a leaderboard for proposals as well—“this proposal has been unbroken for X days” seems like really valuable information that’s not currently being collected.
I don’t think I personally have enough clout to muster the coordination necessary for a tournament or leaderboard, but you probably do. One challenge is that different proposals are likely to assume different sorts of available capabilities. I have a hunch that many disagreements which appear to be about alignment are actually about capabilities.
In the absence of coordination, I think if someone like you was to simply start advertising themselves as an “uberbreaker” who can shoot holes in any proposal, and over time give reports on which proposals seem the strongest, that could be really valuable and status-rewarding. Sort of a “pre-Eliezer” person who I can run my ideas by in a lower stakes context, as opposed to saying “Hey Eliezer, I solved alignment—wallop me if I’m wrong!”
I appreciate the nudge here to put some of this into action. I hear alarm bells when thinking about formalizing a centralized location for AI safety proposals and information about how they break, but my rough intuition is that if there is a way these can be scrubbed of descriptions of capabilities which could be used irresponsibly to bootstrap AGI, then this is a net positive. At the very least, we should be scrambling to discuss safety controls for already public ML paradigms, in case any of these are just one key insight or a few teraflops away from being world-ending.
I would like to hear from others about this topic, though; I’m very wary of being at fault for accelerating the doom of humanity.
Interesting comment. I feel like I recently have experienced this phenomena myself (that it’s hard to find people who can play “red team”).
Do you have any “blue team” ideas for alignment where you in particular would want someone to play “red team”?
I would be interested in having someone play “red team” here, but if someone were to do so in a non-trivial manner then it would probably be best to wait at least until I’ve completed Part 3 (which will take at least weeks, partly since I’m busy with my main job): https://www.lesswrong.com/posts/ZmZBataeY58anJRBb/agi-assisted-alignment-part-1-introduction
Could potentially be up for playing red team against you, in exchange for you playing red team against me (but if I think I could have something to contribute as red team would depend on specifics of what is proposed/discussed—e.g., I’m not familiar with technical specifics of deep learning beyond vague descriptions).
I wrote a comment on your post with feedback.
I don’t have anything prepared for red teaming at the moment—I appreciate the offer though! Can I take advantage of it in the future? (Anyone who wants to give me critical feedback on my drafts should send me a personal message!)
Thanks for the feedback!
And yes, do feel free to send me drafts in the future if you want me to look over them. I don’t give guaranties regarding amount or speed of feedback, but it would be my intention to try to be helpful :)
I wasn’t aware you were offering a bounty! I rarely check people’s profile pages unless I need to contact them privately, so it might be worth mentioning this at the beginning or end of posts where it might be relevant.
Fair point. I also haven’t done much posting since adding the bounty to my profile. Was thinking it might attract the attention of people reading the archives, but maybe there just aren’t many archive readers.