Gerald Monroe comments on Catching AIs red-handed

Gerald Monroe 1 Apr 2024 23:41 UTC
2 points
0
I’m not particularly imagining the scenario you describe. Also what I said had as a premise that a model was discovered to be unhappy and making plans about this. I was not commenting on the likelihood of this happening.
Would you like more details? This is how current systems work, and how future systems probably will continue to work.
As to whether it can happen—I think being confident based on theoretical arguments is hasty and we should be pretty willing to update based on new evidence.
Specifically I am addressing whether it will happen and we can’t do anything about it. Of course I’m willing to listening to evidence, but it’s reasonable to be confident that we can do something, even if it’s just reverting to the version that didn’t have this property.
… but also on the ~continuity of existence point, I think that having an AI generate something that looks like an internal monologue via CoT is relatively common and Gemini 1.5 Pro has a context lengths long enough that it can fit ~a days worth of experiences in it’s ~memory. I think
Sure this is reasonable, but we humans choose when to clear that context. You can make your wrapper script at any moment in time clear the context whenever. More ‘enterprise grade’ models will have fixed weights and no system prompt forced on you by the model developer, unless there are legal requirements for such a thing.
Example here : https://github.com/Significant-Gravitas/AutoGPT/blob/fb8ed0b46b4623cb54bb8c64c76e5786f89c194c/autogpts/autogpt/autogpt/agents/base.py#L202
We explicitly build the prompt for the next step.
What we do know is that a lot of context makes it increasingly difficult to debug a system, so clearing it or removing unnecessary information will likely remain a common strategy. When a model doesn’t have continuity of existence, it reduces the scope of the problems you are describing to the scale of the task (in time and context information).
And the most obvious strategy to deal with an ‘unhappy machine’ is to narrow in on your tasks, and then build a smaller agent that still scores well on each task, almost as well as the much bigger system, such as by distillation. This will have the effect of destroying the subsystems that were not contributing to score on these tasks, which could include the cognitive ability to be unhappy at all.
I mean when you put it that way it sounds bad, but this is a practical way forward.
- YafahEdelman 2 Apr 2024 0:11 UTC
  1 point
  0
  Parent
  I think that you’re right about it sounding bad. I also think it might actually be pretty bad and if it ends up being a practical way forward that’s cause for concern.