This isn’t directly evidence, but I think it’s worth flagging: by the nature the topic, much of the most compelling evidence is potentially hazardous. This will bias the kinds of answers you can get.
(This isn’t hypothetical. I don’t have some One Weird Trick To Blow Up The World, but there’s a bunch of stuff that falls under the policy “probably don’t mention this without good reason out of an abundance of caution.”)
I’m not sure if I fall into the bucket of people you’d consider this to be an answer to. I do think there’s something important in the region of LLMs that, by vibes if not explicit statements of contradiction, seems incompletely propagated in the agent-y discourse even though it fits fully within it. I think I at least have a set of intuitions that overlap heavily with some of the people you are trying to answer.
In case it’s informative, here’s how I’d respond to this:
Mostly agreed, with the capability-related asterisk.
Agreed in the spirit that I think this was meant, but I’d rephrase this: a robust generalist wrench-remover that keeps stubbornly reorienting towards some particular target will tend to be better at reaching that target than a system that doesn’t.
That’s subtly different from individual systems having convergent internal reasons for taking the same path. This distinction mostly disappears in some contexts, e.g. selection in evolution, but it is meaningful in others.
I think this frame is reasonable, and I use it.
Agreed.
Agreed.
Agreed for a large subset of architectures. Any training involving the equivalent of extreme optimization for sparse/distant reward in a high dimensional complex context seems to effectively guarantee this outcome.
Agreed, don’t make the runaway misaligned optimizer.
I think there remains a disagreement hiding within that last point, though. I think the real update from LLMs is:
We have a means of reaching extreme levels of capability without necessarily exhibiting preferences over external world states. You can elicit such preferences, but a random output sequence from the pretrained version of GPT-N (assuming the requisite architectural similarities) has no realistic chance of being a strong optimizer with respect to world states. The model itself remains a strong optimizer, just for something that doesn’t route through the world.
It’s remarkably easy to elicit this form of extreme capability to guide itself. This isn’t some incidental detail; it arises from the core process that the model learned to implement.
That core process is learned reliably because the training process that yielded it leaves no room for anything else. It’s not a sparse/distant reward target; it is a profoundly constraining and informative target.
In other words, a big part of the update for me was in having a real foothold on loading the full complexity of “proper targets.”
I don’t think what we have so far constitutes a perfect and complete solution, the nice properties could be broken, paradigms could shift and blow up the golden path, it doesn’t rule out doom, and so on, but diving deeply into this has made many convergent-doom paths appear dramatically less likely to Late2023!porby compared to Mid2022!porby.