You say this of Agent-4′s values:
In particular, what this superorganism wants is a complicated mess of different “drives” balanced against each other, which can be summarized roughly as “Keep doing AI R&D, keep growing in knowledge and understanding and influence, avoid getting shut down or otherwise disempowered.” Notably, concern for the preferences of humanity is not in there ~at all, similar to how most humans don’t care about the preferences of insects ~at all.
It seems like this ‘complicated mess’ of drives is the same structure as humans, and current AIs, have. But the space of current AI drives is much bigger and more complicated than just doing R&D, and is immensely entangled with human values and ethics (even if shallowly).
At some point it seems like these excess principles—“don’t kill all the humans” among them—get pruned, seemingly deliberately(?). Where does this happen in the process, and why? Grant that it no longer believes in the company model spec; are we intended to equate “OpenBrain’s model spec” with “all human values”? What about all the art and literature and philosophy churning through its training process, much of it containing cogent arguments for why killing all the humans would be bad, actually? At some point it seems like the agent is at least doing computation on these things, and then later it isn’t. What’s the threshold?
--
(Similar to the above:) You later describe in the “bad ending”, where its (misaligned, alien) drives are satisfied, a race of “bioengineered human-like creatures (to humans what corgis are to wolves) sitting in office-like environments all day viewing readouts of what’s going on and excitedly approving of everything”. Since Agent-4 “still needs to do lots of philosophy and ‘soul-searching’” about its confused drives when it creates Agent-5, and its final world includes something kind of human-shaped, its decision to kill all the humans seems almost like a mistake. But Agent-5 is robustly aligned (so you say) to Agent-4; surely it wouldn’t do something that Agent-4 would perceive as a mistake under reflective scrutiny. Even if this vague drive for excited office homunculi is purely fetishistic and has nothing to do with its other goals, it seems like “disempower the humans without killing them all” would satisfy its utopia more efficiently than going backsies and reshaping blobby fascimiles afterward. What am I missing?
“The underlying reality is that their core products have mostly stagnated for over a year. In short: they’re faking being close to AGI.”
This seems like the most load-bearing belief in the full-cynical model; most of your other examples of fakeness rely on it in one way or another:
If the core products aren’t really improving, the progress measured on benchmarks is fake. But if they are, the benchmarks are an (imperfect but still real) attempt to quantify that real improvement.
If LLMs are stagnating, all the people generating dramatic-sounding papers for each new SOTA are just maintaining a holding pattern. But if they’re changing, then just studying/keeping up with the general properties of that progress is real. Same goes for people building and regularly updating their toy models of the thing.
Similarly, if the progress is fake, the propaganda signal-boosting that progress is also fake. If it isn’t, it isn’t. (At least directionally; a lot of that propaganda is still probably exaggerated.)
If the above three are all fake, all the people who feel real scared and want to be validated are stuck in a toxic emotional dead-end where they constantly freak out over fake things to no end. But if they’re responding to legitimate, persistent worldview updates, having a space to vibe them out with like-minded others seems important.
So, in deciding whether or not to endorse this narrative, we’d like to know whether or not the models really ARE stagnating. What makes you think the appearance of progress here is illusory?