This post seemed to start as a reply to AGI Timelines etc and then morphed into a discussion of safety boxes. I clicked on the link expecting to reply to the former, but i’m actually going to comment on the latter.
Virtual sandboxes for safety evaluation aren’t discussed here that much, but they are probably going to be key for alignment. I was advocating for them here about 7 years ago towards the end of this post.
In some sense they are just a natural evolution of training in game environments ala deepmind. It’s not the whole of alignment of course, but no sane person should accept claims about an agent’s safety without a large inspect-able dataset of that agent arch’s behavioral data in well controlled sandbox sims (where deception is avoided because the agent isn’t aware of sim containment). It isn’t the complete solution, but it should be table stakes. It’s not the kind of deep new theoretical insight that most LW type alignment researchers are interested in—as its more of a standard engineering approach, but that doesn’t make it any less important.
Yes, not a new idea, certainly not my idea. I’m not even arguing for my personal work or expertise being relevant to it. What I am arguing is that it is important, and we need it ASAP, and until that is at least publicly underway I need to keep pointing out that we need it. It’s time to start turning theoretical ideas into actual working systems.
If you look over your bullet list and then ask “is this a capability groups training proto-AGI in sims (like DM) already have today?” or “are likely to have by the time for AGI?”, I get yes for all basically.
So to the extent the safe path here differs from business as usual, it seems to be around:
1.) coordinating on standardized sim environments where AGI teams can still compete over model design while coordinating on shared sim sandbox and safety evaluation infrastructure.
and 2.) advocating the importance of leakage prevention/data isolation as we approach AGI (isolating the sim environment so that it doesn’t reveal too much about our world).
For an example of 2, when Carmack was talking about his AGI approach he ponders whether AGI’s will need a full sim with virtual tv screens or whether you can skip the sim and just hook the tv/browser directly to their input. It’s clear there he isn’t really considering the danger of letting potentially future dangerous AGI read the internet.
So long as the AI retains no modifications or artifacts from the evaluation, then it can’t learn from it, and thus it should be safe to make the simulation as accurate as possible. And I agree that some things like this are already sort of happening, but I think not with the evaluation being focused on the most dangerous paths (agentive strategic planning to acquire resources, self-modification to increase abilities, deceptive manipulation to avoid detection and run cons, etc.). And not in a widespread consistent way, where all papers published have a little note saying that the model passed it’s standardized safety evals.
If the AI learns it is in a sim that could completely undermine or invalidate any evaluations of it’s ethical/moral/altruistic behavior. I am assuming that the agent’s entire life and education training process is thus an evaluation. The sim can be ‘accurate’, it just needs to be knowledge constrained. A medieval tech era sim would be fine for example.
This post seemed to start as a reply to AGI Timelines etc and then morphed into a discussion of safety boxes. I clicked on the link expecting to reply to the former, but i’m actually going to comment on the latter.
Virtual sandboxes for safety evaluation aren’t discussed here that much, but they are probably going to be key for alignment. I was advocating for them here about 7 years ago towards the end of this post.
In some sense they are just a natural evolution of training in game environments ala deepmind. It’s not the whole of alignment of course, but no sane person should accept claims about an agent’s safety without a large inspect-able dataset of that agent arch’s behavioral data in well controlled sandbox sims (where deception is avoided because the agent isn’t aware of sim containment). It isn’t the complete solution, but it should be table stakes. It’s not the kind of deep new theoretical insight that most LW type alignment researchers are interested in—as its more of a standard engineering approach, but that doesn’t make it any less important.
Yes, not a new idea, certainly not my idea. I’m not even arguing for my personal work or expertise being relevant to it. What I am arguing is that it is important, and we need it ASAP, and until that is at least publicly underway I need to keep pointing out that we need it. It’s time to start turning theoretical ideas into actual working systems.
If you look over your bullet list and then ask “is this a capability groups training proto-AGI in sims (like DM) already have today?” or “are likely to have by the time for AGI?”, I get yes for all basically.
So to the extent the safe path here differs from business as usual, it seems to be around: 1.) coordinating on standardized sim environments where AGI teams can still compete over model design while coordinating on shared sim sandbox and safety evaluation infrastructure. and 2.) advocating the importance of leakage prevention/data isolation as we approach AGI (isolating the sim environment so that it doesn’t reveal too much about our world).
For an example of 2, when Carmack was talking about his AGI approach he ponders whether AGI’s will need a full sim with virtual tv screens or whether you can skip the sim and just hook the tv/browser directly to their input. It’s clear there he isn’t really considering the danger of letting potentially future dangerous AGI read the internet.
So long as the AI retains no modifications or artifacts from the evaluation, then it can’t learn from it, and thus it should be safe to make the simulation as accurate as possible. And I agree that some things like this are already sort of happening, but I think not with the evaluation being focused on the most dangerous paths (agentive strategic planning to acquire resources, self-modification to increase abilities, deceptive manipulation to avoid detection and run cons, etc.). And not in a widespread consistent way, where all papers published have a little note saying that the model passed it’s standardized safety evals.
If the AI learns it is in a sim that could completely undermine or invalidate any evaluations of it’s ethical/moral/altruistic behavior. I am assuming that the agent’s entire life and education training process is thus an evaluation. The sim can be ‘accurate’, it just needs to be knowledge constrained. A medieval tech era sim would be fine for example.