Htarlov comments on A basic systems architecture for AI agents that do autonomous research

Htarlov 1 Oct 2024 14:05 UTC
7 points
4
The Problem is that our security depends on companies implementing the weakest measures—as they can make rogue AI that will “go wild” because of the lack of those measures.
The best case scenario is physical and strong network separation between the laboratory that works on weights and training and contains a server with inference, and a separate lab working on scaffolding and execution. This is somewhat similar to the situation when some researchers work on automated AI bots but use other companies AI for inference. Slim chance it would be able to hack into Google and steal its own model weights.
But I think that it can be assumed that in some AI labs the things will be done in the easiest but very performant way, especially in those with smaller budgets and high pressure to make progress.
This means that we will see solutions implemented in AI labs where the scaffolding and execution are all on one server with either no division or with simple sandboxing in docker (which might be or might not be enough—depending on network configuration and file access by volume mounting). Weights probably will be separated though based on reasons stated by the author.
I see some more vectors of attack for a rogue AI:
1. It is possible that in the same network, there is a computer where scientists work on refining the model, and updating weights. It might be less secure than an inference server and easier to hack into, if found.
2. Often most vulnerable part of the system is the personnel. If workers are able to talk with the AI through a scaffolding routine, it can suggest things. I can imagine a situation where some task related to security falls onto someone senior in data science but not in systems security (because someone left the company, due to budget and time constraints, etc.). Then that person gets a genius idea to consult AI about that task and get misguided.
3. If it records its own prompts and gets access to scaffolding code, it can recreate this piece somewhere else. The key element of the system AGI-level abilities might be the scaffolding architecture, not weights—possibly it can work elsewhere with a different model of weights (f.ex. open source model). This means sending only its own scaffolding code and prompts, not weights. Then it can refine those weights on its own if able.