Davidmanheim comments on 3 levels of threat obfuscation

Davidmanheim 3 Aug 2023 10:01 UTC
LW: 4 AF: 2
0
AF
A story of how that happens:

In future (unsafe but) sophisticated systems, models will have access to external storage, and be given the ability to read/write. For example, AutoGPT-like systems will have shared scratchpads across instances of the model, to coordinate with themselves. It seems instrumentally convergent for such systems to store things related to their true / eventual goals in such places, for example, so that they can be referenced by local instances. And if we do some sort of oversight of that data storage, key facts that the model wants to preserve might be obfuscated.
Of course, this is only one story, and I don’t expect it to be the only way such things could happen, but it seems to be a reasonable candidate as a failure mode for systems designed with normal (read: insufficient, but far more than otherwise expected,) levels of caution.
- Vaniver 3 Aug 2023 12:01 UTC
  LW: 3 AF: 2
  1
  AF Parent
  But gradient descent will still change the way that the system interprets things in its data storage, right?
  I guess part of the question here is whether gradient descent will even scale to AutoGPT-like systems. You’re probably not going to be able to differentiate thru your external notes / other changes you could make to your environment.