Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.
This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the “layer 2” support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.
Note however that having more powerful internal-only models “analyzing patterns” across multiple conversations, and in a position to affect change (especially by intervening on individual conversations while retaining long-term memories), would worsen the potential for AI systems to carry out coordinated scheming campaigns.
This could be mitigated by combining it with privacy-preservation architectures such as Anthropic’s existing work on Clio.
Overall yes: what I was imagining is mostly just adding scalable bi-directionality, where, for example, if a lot of Assistants are running into similar confusing issue, it gets aggregated, the principal decides how to handle it in abstract, and the “layer 2” support disseminates the information. So, greater power to scheme would be coupled with stronger human-in-the loop component & closer non-AI oversight.