I’d like some feedback on my theory of impact for my currently chosen research path
**End goal**: Reduce x-risk from AI and risk of human disempowerment. for x-risk: - solving AI alignment—very important, - knowing exactly how well we’re doing in alignment, exactly how close we are to solving it, how much is left, etc seems important. - how well different methods work, - which companies are making progress in this, which aren’t, which are acting like they’re making progress vs actually making progress, etc —put all on a graph, see who’s actually making the line go up
- Also, a way that others can use to measure how good their alignment method/idea is, easily so there’s actually a target and a progress bar for alignment—seems like it’d make alignment research a lot easier and improve the funding space—and the space as a whole. Improving the quality and quantity of research.
- Currently, it’s mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc - all almost exclusively on the end models as a whole—which have many, many differences that could be contributing to the differences in the different ’alignment measurements’ by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we’re doing in alignment and how to prioritize research, funding, governence, etc
On Goodharting the Line—will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.
What is the proposed research path and its theory of impact? It’s not clear from reading your note / generally seems too abstract to really offer any feedback
I’d like some feedback on my theory of impact for my currently chosen research path
**End goal**: Reduce x-risk from AI and risk of human disempowerment.
for x-risk:
- solving AI alignment—very important,
- knowing exactly how well we’re doing in alignment, exactly how close we are to solving it, how much is left, etc seems important.
- how well different methods work,
- which companies are making progress in this, which aren’t, which are acting like they’re making progress vs actually making progress, etc
—put all on a graph, see who’s actually making the line go up
- Also, a way that others can use to measure how good their alignment method/idea is, easily
so there’s actually a target and a progress bar for alignment—seems like it’d make alignment research a lot easier and improve the funding space—and the space as a whole. Improving the quality and quantity of research.
- Currently, it’s mostly a mixture of vibe checks, occasional benchmarks that test a few models, jailbreaks, etc
- all almost exclusively on the end models as a whole—which have many, many differences that could be contributing to the differences in the different ’alignment measurements’
by having a method that keeps things controlled as much as possible and just purely measures the different post training methods, this seems like a much better way to know how we’re doing in alignment
and how to prioritize research, funding, governence, etc
On Goodharting the Line—will also make it modular, so that people can add their own benchmarks, and highlight people who redteam different alignment benchmarks.
What is the proposed research path and its theory of impact? It’s not clear from reading your note / generally seems too abstract to really offer any feedback