Gurkenglas comments on Speculations against GPT-n writing alignment papers

Gurkenglas 7 Jun 2021 22:57 UTC
LW: 6 AF: 1
AF
Translating to a tree of natural language descriptions and back lets you
1. use different instances of GPT at each step, so that any coded messages would have to rely on the common training set rather than shared internals.
2. do a random subset of the tasks yourself, so that any subversion of the target system would have to be based on subtle biases, perhaps with a small number of outright lies that it hopes we don’t implement manually, rather than the description being fabricated from whole cloth.
Having the thing write papers is merely an existence proof of embedded agency being irrelevant except for deconfusion.

Intelligent agents causally responsible for your existence.

What do you mean you can think of this, I told this to you :D
- Donald Hobson 8 Jun 2021 11:17 UTC
  LW: 2 AF: 1
  AF Parent
  Maybe you did. I find it hard to distinguish inventing and half remembering ideas.
  If the training procedure either
  1. Reliably produces mesaoptimisers with about the same values. or
  2. Reliably produces mesaoptimizers that can acausally cooperate
  3. The rest of the procedure allows one mesaoptimizer to take control of the whole output
  Then using different copies of GPT-n trained from different seeds doesn’t help.
  If you just convert 1% of the english into network yourself, then all it needs to use is some error correction. Even without that, neural net structures are somewhat redundant. A malicious optimizer with a random 1% removed is probably still a malicious optimizer.
  Secondly, its goal here is to give us subtly wrong alignment papers. Suppose we produce an examination and transparency step so thorough that no mesa optimizer can slip through. The original mesaoptimiser can just design a network that is doing its best as a mesaoptimiser free predictor, but has subtly maliciously chosen beliefs about AI research.
  - Gurkenglas 8 Jun 2021 12:55 UTC
    2 points
    Parent
    The error correction needs to be present in the original network because I also do some of the converting network into english. The only reason I don’t do everything myself is that it takes too long. The proportion can be higher at the topmost levels because there are less tasks there. The error correction doesn’t let it completely ignore what I would do at the low levels because on the 1% I can compare its outputs to mine, so they need to at least superficially look similar.
    If we find that there’s a bunch of redundancy, we can check whether there is any way to cut it down that would suddenly identify a bunch of mesa optimization. So the mesaoptimizer would have to take over the network entirely or trick its bretheren as well as us.