I think your first point basically covers why—people are worried about alignment difficulties in superhuman systems, in particular (because those are the dangerous systems which can cause existential failures). I think a lot of current RLHF work is focused on providing reward signals to current systems in ways that don’t directly address the problem of “how do we reward systems with behaviors that have consequences that are too complicated for humans to understand”.
abergal
Funding for programs and events on global catastrophic risk, effective altruism, and other topics
Funding for work that builds capacity to address risks from transformative AI
Updates to Open Phil’s career development and transition funding program
The Long-Term Future Fund is looking for a full-time fund chair
Long-Term Future Fund Ask Us Anything (September 2023)
Long-Term Future Fund: April 2023 grant recommendations
[AMA] Announcing Open Phil’s University Group Organizer and Century Fellowships [x-post]
Chris Olah wrote this topic prompt (with some feedback from me (Asya) and Nick Beckstead). We didn’t want to commit him to being responsible for this post or responding to comments on it, so we submitted this on his behalf. (I’ve changed the by-line to be more explicit about this.)
Truthful and honest AI
Interpretability
Techniques for enhancing human feedback
Measuring and forecasting risks
Request for proposals for projects in AI alignment that work with deep learning systems
Thanks for writing this! Would “fine-tune on some downstream task and measure the accuracy on that task before and after fine-tuning” count as measuring misalignment as you’re imagining it? My sense is that there might be a bunch of existing work like that.
This RFP is an experiment for us, and we don’t yet know if we’ll be doing more of them in the future. I think we’d be open to including research directions we think that are promising that apply equally well to both DL and non-DL systems—I’d be interested in hearing any particular suggestions you have.
(We’d also be happy to fund particular proposals in the research directions we’ve already listed that apply to both DL and non-DL systems, though we will be evaluating them on how well they address the DL-focused challenges we’ve presented.)
Getting feedback in the next week would be ideal; September 15th will probably be too late.
Different request for proposals!
Thanks for writing this up—at least for myself, I think I agree with the majority of this, and it articulates some important parts of how I live my life in ways that I hadn’t previously made explicit for myself.