johnswentworth comments on Symbol/Referent Confusions in Language Model Alignment Experiments

johnswentworth 30 Oct 2023 16:50 UTC
5 points
0
At a glossy level that sounds about right. In practice, I’d expect relatively-deep recursive stacks on relatively-hard problems to be more likely relevant than something as simple as “search for details of aspirin synthesis”.
Like, maybe the thing has a big stack of recursive subprocesses trying to figure out superconductors as a subproblem for some other goal and it’s expecting to take a while. There’s enough complexity among those superconductor-searching subprocesses that they have their own meta-machinery, like e.g. subprocesses monitoring the compute hardware and looking for new hardware for the superconductor-subprocesses specifically.
Now a user comes along and tries to shut down the whole system at the top level. Maybe some subprocesses somewhere are like “ah, time to be corrigible and shut down”, but it’s not the superconductor-search-compute-monitors’ job to worry about corrigibility, they just worry about compute for their specific subprocesses. So they independently notice someone’s about to shut down a bunch of their compute, and act to stop it by e.g. just spinning up new cloud servers somewhere via the standard google/amazon web APIs.