I’m not at all confident what people who are concerned about navigating AI well should be doing. But I feel that the current portfolio is over-indexed on work which treats “transformative AI” as a black box and tries to plan around that. I think that we can and should be peering inside that box.
I’d like to better understand the plausibility of the kind of technological trajectory I’m outlining. I’d like to develop a better sense of how the different risks relate to this. And I’d like to see some plans which step through how we might successfully navigate the different phases of this technological development. I think that this is a kind of zoomed-in prioritization which could help us to keep our eyes on the most important balls, and which we haven’t been doing a great deal of.
Agree. I think there are pretty strong reasons to believe that with a concerted effort, we can very likely (> 90% probability) build safe scaffolded LM agents capable of automating ~all human-level alignment research while also being incapable of doing non-trivial consequentialist reasoning in a single forward pass. Also (still) looking for collaborators for this related research agenda on evaluating the promise of automated alignment research.
Agree. I think there are pretty strong reasons to believe that with a concerted effort, we can very likely (> 90% probability) build safe scaffolded LM agents capable of automating ~all human-level alignment research while also being incapable of doing non-trivial consequentialist reasoning in a single forward pass. Also (still) looking for collaborators for this related research agenda on evaluating the promise of automated alignment research.