edit: I took a quick look, and this looks really good! Big upvote. Definitely an impressive body of work. And the actual alignment proposal is along the lines of the ones I find most promising on the current trajectory toward AGI. I don’t see a lot of references to existing alignment work, but I do see a lot of references to technical results, which is really useful and encouraging. Look at Daniel Kokatijlo’s and other work emphasizing faithful chain of thought for similar suggestions.
edit continued: I find your framing a bit odd, in starting from an unaligned, uninterpretable AGI (but that’s presumably under control). I wonder if you’re thinking of something like o1 that basically does what it’s told WRT answering questions/providing data but can’t be considered overall aligned, and which isn’t readily interpretable because we can’t see its chain of thought? A brief post situating that proposal in relation to current or near-future systems would be interesting, at least to me.
Original:
Interesting. 100 pages is quite a time commitment. And you don’t reference any existing work in your brief pitch here—that often signals that people haven’t read the literature, so most of their work is redundant with existing stuff or missing big considerations that are part of the public discussion. But it seems unlikely that you’d put in 100 pages of writing without doing some serious reading as well.
Here’s what I suggest: relate this to existing work, and reduce the reading-time ask, by commenting on related posts with a link to and summary of the relevant sections of your paper.
100 pages is quite a time commitment. And you don’t reference any existing work in your brief pitch here—that often signals that people haven’t read the literature, so most of their work is redundant with existing stuff or missing big considerations that are part of the public discussion.
Seconded. Although I’m impressed by how many people have already read and commented below, I think many more people will be willing to engage with your ideas if you separate out some key parts and present them here on their own. A separate post on section 6.1 might be an excellent place to start. Once you demonstrate to other researchers that you have promising and novel ideas in some specific area, they’ll be a lot more likely to be willing to engage with your broader work.
It’s an unfortunate reality of safety research that there’s far more work coming out than anyone can fully keep up with, and a lot of people coming up with low-quality ideas that don’t take into account evidence from existing work, and so making parts of your work accessible in a lower-commitment way will make it much easier for people to find time to read it.
Thank you for your suggestions! I will read the materials you recommended and try to cite more related works.
For o1, I think o1 is the right direction. The developers of o1 should be able to see the hidden chain of thoughts of o1, which is explainable for them.
I think that alignment or interpretability is not a “yes” or “no” property, but a gradually changing property. o1 has done a good job in terms of interpretability, but there is still room for improvement. Similarly, the first AGI to come out in the future may be partially aligned and partially interpretable, and then the approaches in this paper can be used to improve its alignment and interpretability.
edit: I took a quick look, and this looks really good! Big upvote. Definitely an impressive body of work. And the actual alignment proposal is along the lines of the ones I find most promising on the current trajectory toward AGI. I don’t see a lot of references to existing alignment work, but I do see a lot of references to technical results, which is really useful and encouraging. Look at Daniel Kokatijlo’s and other work emphasizing faithful chain of thought for similar suggestions.
edit continued: I find your framing a bit odd, in starting from an unaligned, uninterpretable AGI (but that’s presumably under control). I wonder if you’re thinking of something like o1 that basically does what it’s told WRT answering questions/providing data but can’t be considered overall aligned, and which isn’t readily interpretable because we can’t see its chain of thought? A brief post situating that proposal in relation to current or near-future systems would be interesting, at least to me.
Original:
Interesting. 100 pages is quite a time commitment. And you don’t reference any existing work in your brief pitch here—that often signals that people haven’t read the literature, so most of their work is redundant with existing stuff or missing big considerations that are part of the public discussion. But it seems unlikely that you’d put in 100 pages of writing without doing some serious reading as well.
Here’s what I suggest: relate this to existing work, and reduce the reading-time ask, by commenting on related posts with a link to and summary of the relevant sections of your paper.
Seconded. Although I’m impressed by how many people have already read and commented below, I think many more people will be willing to engage with your ideas if you separate out some key parts and present them here on their own. A separate post on section 6.1 might be an excellent place to start. Once you demonstrate to other researchers that you have promising and novel ideas in some specific area, they’ll be a lot more likely to be willing to engage with your broader work.
It’s an unfortunate reality of safety research that there’s far more work coming out than anyone can fully keep up with, and a lot of people coming up with low-quality ideas that don’t take into account evidence from existing work, and so making parts of your work accessible in a lower-commitment way will make it much easier for people to find time to read it.
Thank you for your suggestions! I will read the materials you recommended and try to cite more related works.
For o1, I think o1 is the right direction. The developers of o1 should be able to see the hidden chain of thoughts of o1, which is explainable for them.
I think that alignment or interpretability is not a “yes” or “no” property, but a gradually changing property. o1 has done a good job in terms of interpretability, but there is still room for improvement. Similarly, the first AGI to come out in the future may be partially aligned and partially interpretable, and then the approaches in this paper can be used to improve its alignment and interpretability.