In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:
example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)
i’d also still be impressed by simple theories of aimable cognition (i mostly don’t expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)
fwiw i don’t myself really know how to answer the question “technical research is more useful than policy research”; like that question sounds to me like it’s generated from a place of “enough of either of these will save you” whereas my model is more like “you need both”
tho i’m more like “to get the requisite technical research, aim for uploads” at this juncture
if this was gonna be blasted outwards, i’d maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful
(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it’s moving too slowly etc.)
In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said: