1a3orn comments on Symbol/Referent Confusions in Language Model Alignment Experiments

1a3orn 30 Oct 2023 15:04 UTC
13 points
2
So—to make this concrete—something like ChemCrow is trying to make asprin.

Part of the master planner for ChemCrow spins up a google websearch subprocess to find details of the asprin creation process. But then the Google websearch subprocess—or some other part—is like “oh no, I’m going to be shut down after I search for asprin,” or is like “I haven’t found enough asprin-creation processes yet, I need infinite asprin-creation processes” or just borks itself in some unspecified way—and something like this means that it starts to do things that “won’t play well with shutdown.”

Concretely, at this point, the Google websearch subprocess does some kind of prompt injection on the master planner / refuses to relinquish control of the thread, which has been constructed as blocking by the programmer / forms an alliance with some other subprocess / [some exploit], and through this the websearch subprocess gets control over the entire system. Then the websearch subprocess takes actions to resist shutdown of the entire thing, leading to non-corrigibility.

This is the kind of scenario you have in mind? If not, what kind of AutoGPT process did you have in mind?
- johnswentworth 30 Oct 2023 16:50 UTC
  5 points
  0
  Parent
  At a glossy level that sounds about right. In practice, I’d expect relatively-deep recursive stacks on relatively-hard problems to be more likely relevant than something as simple as “search for details of aspirin synthesis”.
  Like, maybe the thing has a big stack of recursive subprocesses trying to figure out superconductors as a subproblem for some other goal and it’s expecting to take a while. There’s enough complexity among those superconductor-searching subprocesses that they have their own meta-machinery, like e.g. subprocesses monitoring the compute hardware and looking for new hardware for the superconductor-subprocesses specifically.
  Now a user comes along and tries to shut down the whole system at the top level. Maybe some subprocesses somewhere are like “ah, time to be corrigible and shut down”, but it’s not the superconductor-search-compute-monitors’ job to worry about corrigibility, they just worry about compute for their specific subprocesses. So they independently notice someone’s about to shut down a bunch of their compute, and act to stop it by e.g. just spinning up new cloud servers somewhere via the standard google/amazon web APIs.

1a3orn comments on Symbol/​Referent Confusions in Language Model Alignment Experiments

1a3orn comments on Symbol/Referent Confusions in Language Model Alignment Experiments