Let me here talk about some uptakes from all this.
(Note: As with previous posts, this is “me writing about my thoughts and experiences in case they are useful to someone”, putting in relatively low effort. It’s a conscious decision to put these in shortform posts, where they are not shoved to everyone’s faces.)
The main point is that I now think it’s much more feasible to do useful technical AI safety work than I previously thought. This update is a result of realizing both that the action space is larger than I thought (this is a theme in the object-level post) and that I have been intimidated by the culture in LW (see meta-post).
One day I heard someone saying “I thought AI alignment was about coming up with some smart shit, but it’s more like doing a bunch of kinda annoying things”. This comment stuck with me.
For reasons such as: it provides actual non-zero information about safety techniques and deceptive alignment; it’s a clear demonstration of failures of safety techniques; it provides a test case for testing new alignment techniques and lays out the idea “we could come up with more test cases”.
2: The work doesn’t contain a 200 IQ godly breakthrough idea.
(Before you ask: I’m not belittling the work. See point 1 above.)
Like: There are a lot of motivations for the work. Many of them are intuitive. Many build on previous work. The setup is natural. The used techniques are standard.
The value is in stuff like combining a dozen “obvious” ideas in a suitable way, carefully designing the experiment, properly implementing the experiment, writing it down clearly and, you know, actually showing up and doing the thing.
And yep, one shouldn’t hindsight-bias oneself to think all of this is obvious. Clearly I myself didn’t come up with the idea starting from the null string. I still think that I could contribute, to have the field produce more things like that. None of the individual steps is that hard—or, there exist steps that are not that hard. Many of them are “people who have the competence to do the standard things do the standard things” (or, as someone would say, “do a bunch of kinda annoying things”).
I don’t think the bottleneck is “coming up with good project ideas”. I’ve heard a lot of project ideas lately. While all of them aren’t good, in absolute terms many of them are. Turns out that coming up with an idea takes 10 seconds or 1 hour, and then properly executing that requires 10 hours or 1 full-time-equivalent-years.
So I actually think that the bottleneck is more about “we have people executing the tons of projects the field comes up with”, at least much more than I previously thought.
And sure, for individual newcomers it’s not trivial to come up with good projects. Realistically one needs (or at least I needed) more than the null string. I’ll talk about this more in my final post.
Part 3⁄4 - General uptakes
In my previous two shortform posts I’ve talked about some object-level belief changes about technical alignment and some meta-level thoughts about how to do research, both which were prompted by starting in an alignment program.
Let me here talk about some uptakes from all this.
(Note: As with previous posts, this is “me writing about my thoughts and experiences in case they are useful to someone”, putting in relatively low effort. It’s a conscious decision to put these in shortform posts, where they are not shoved to everyone’s faces.)
The main point is that I now think it’s much more feasible to do useful technical AI safety work than I previously thought. This update is a result of realizing both that the action space is larger than I thought (this is a theme in the object-level post) and that I have been intimidated by the culture in LW (see meta-post).
One day I heard someone saying “I thought AI alignment was about coming up with some smart shit, but it’s more like doing a bunch of kinda annoying things”. This comment stuck with me.
Let’s take a concrete example. Very recently the “Sleeper Agents” paper came out. And I think both of the following are true:
1: This work is really good.
For reasons such as: it provides actual non-zero information about safety techniques and deceptive alignment; it’s a clear demonstration of failures of safety techniques; it provides a test case for testing new alignment techniques and lays out the idea “we could come up with more test cases”.
2: The work doesn’t contain a 200 IQ godly breakthrough idea.
(Before you ask: I’m not belittling the work. See point 1 above.)
Like: There are a lot of motivations for the work. Many of them are intuitive. Many build on previous work. The setup is natural. The used techniques are standard.
The value is in stuff like combining a dozen “obvious” ideas in a suitable way, carefully designing the experiment, properly implementing the experiment, writing it down clearly and, you know, actually showing up and doing the thing.
And yep, one shouldn’t hindsight-bias oneself to think all of this is obvious. Clearly I myself didn’t come up with the idea
starting from the null string. I still think that I could contribute, to have the field produce more things like that. None of the individual steps is that hard—or, there exist steps that are not that hard. Many of them are “people who have the competence to do the standard things do the standard things” (or, as someone would say, “do a bunch of kinda annoying things”).I don’t think the bottleneck is “coming up with good project ideas”. I’ve heard a lot of project ideas lately. While all of them aren’t good, in absolute terms many of them are. Turns out that coming up with an idea takes 10 seconds or 1 hour, and then properly executing that requires 10 hours or 1 full-time-equivalent-years.
So I actually think that the bottleneck is more about “we have people executing the tons of projects the field comes up with”, at least much more than I previously thought.
And sure, for individual newcomers it’s not trivial to come up with good projects. Realistically one needs (or at least I needed) more than the null string. I’ll talk about this more in my final post.