Your “given a lab environment where we strongly expect an AI to be roughly human level” seem to assume the thing to be proven.
Suppose a lab is trying out a novel AI architecture or even just a significant algorithmic innovation. You can inspect the source code for the training algorithm, can inspect the trillions of weights that result from the training, and can ask questions of the designers. How do you estimate its capabilities without giving the AI a lot of access to things it can use to take over (e.g., the ability to write code and have it run, gullible humans)?
If you have an approach I think is promising, I’ll drop everything and lobby funders to give you a big salary to work on your approach.
The problem with the simulated environment is making it accurate enough that the AI cannot tell that it is in a simulated environment. Even if you can fool the AI with 99% probability, the 1% of cases in which the AI is not fooled will tend to be the ones where the AI is very dangerous indeed.
Your “given a lab environment where we strongly expect an AI to be roughly human level” seems to assume the thing to be proven.
But if we are being Bayesians here, then it seems to become clearer. Let’s assume that the law says we can safely and effectively evaluate an AGI that’s truly 0-200% as smart as the smartest human alive (revisit this as our alignment tools improve). Now, as you say, in the real world there is dangerous uncertainty about this. So how much probability do we put on, say, GPT-5 exceeding that limit? That’s our risk, and it either meets or exceeds our risk budget. If there was a regulatory body overseeing that, you would need to submit the evidence to back this up, and they would hire smart people or AI to make the judgments (the FDA does this now, to mixed effects). The applicant can put in work to derisk the prospect.
With a brand-new architecture, you are right, we shouldn’t scale it up and test the big version; that would be too risky! We evaluate small versions first and use that to establish priors about how capabilities scale. This is generally how architecture improvements works, anyway, so it wouldn’t require extra work.
Does that make sense? To be, this seems like a more practical approach than once-and-done alignment, or pause everything. And most importantly of all, it seems to set up the incentives correctly—commercial success is dependent on convincing us you have narrower AI, better alignment tools, or have increased human intelligence.
If you have an approach I think is promising, I’ll drop everything and lobby funders to give you a big salary to work on your approach.
I wish I did, mate ;p Even better than a big salary, it would give me and my kids a bigger chance to survive.
Your “given a lab environment where we strongly expect an AI to be roughly human level” seem to assume the thing to be proven.
Suppose a lab is trying out a novel AI architecture or even just a significant algorithmic innovation. You can inspect the source code for the training algorithm, can inspect the trillions of weights that result from the training, and can ask questions of the designers. How do you estimate its capabilities without giving the AI a lot of access to things it can use to take over (e.g., the ability to write code and have it run, gullible humans)?
If you have an approach I think is promising, I’ll drop everything and lobby funders to give you a big salary to work on your approach.
The problem with the simulated environment is making it accurate enough that the AI cannot tell that it is in a simulated environment. Even if you can fool the AI with 99% probability, the 1% of cases in which the AI is not fooled will tend to be the ones where the AI is very dangerous indeed.
But if we are being Bayesians here, then it seems to become clearer. Let’s assume that the law says we can safely and effectively evaluate an AGI that’s truly 0-200% as smart as the smartest human alive (revisit this as our alignment tools improve). Now, as you say, in the real world there is dangerous uncertainty about this. So how much probability do we put on, say, GPT-5 exceeding that limit? That’s our risk, and it either meets or exceeds our risk budget. If there was a regulatory body overseeing that, you would need to submit the evidence to back this up, and they would hire smart people or AI to make the judgments (the FDA does this now, to mixed effects). The applicant can put in work to derisk the prospect.
With a brand-new architecture, you are right, we shouldn’t scale it up and test the big version; that would be too risky! We evaluate small versions first and use that to establish priors about how capabilities scale. This is generally how architecture improvements works, anyway, so it wouldn’t require extra work.
Does that make sense? To be, this seems like a more practical approach than once-and-done alignment, or pause everything. And most importantly of all, it seems to set up the incentives correctly—commercial success is dependent on convincing us you have narrower AI, better alignment tools, or have increased human intelligence.
I wish I did, mate ;p Even better than a big salary, it would give me and my kids a bigger chance to survive.