Necessary conditions aren’t sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
I don’t see any reason to think that prompting would achieve this robustly.
As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
I note that I don’t expect trying it empirically to produce catastrophe in the immediate term (though I can’t rule it out).
I also don’t expect it to produce useful understanding of what would give a robust generalization guarantee.
With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
The “capabilities are very important [for safety]” point seems misleading:
Capabilities create the severe risks in the first place.
We can’t create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
There’s no ”...so it makes sense that we’re working on capabilities” corollary here.
The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)
Some thoughts:
Necessary conditions aren’t sufficient conditions. Lists of necessary conditions can leave out the hard parts of the problem.
The hard part of the problem is in getting a system to robustly behave according to some desirable pattern (not simply to have it know and correctly interpret some specification of the pattern).
I don’t see any reason to think that prompting would achieve this robustly.
As an attempt at a robust solution, without some other strong guarantee of safety, this is indeed a terrible idea.
I note that I don’t expect trying it empirically to produce catastrophe in the immediate term (though I can’t rule it out).
I also don’t expect it to produce useful understanding of what would give a robust generalization guarantee.
With a lot of effort we might achieve [we no longer notice any problems]. This is not a generalization guarantee. It is an outcome I consider plausible after putting huge effort into eliminating all noticeable problems.
The “capabilities are very important [for safety]” point seems misleading:
Capabilities create the severe risks in the first place.
We can’t create a safe AGI without advanced capabilities, but we may be able to understand how to make an AGI safe without advanced capabilities.
There’s no ”...so it makes sense that we’re working on capabilities” corollary here.
The correct global action would be to try gaining theoretical understanding for a few decades before pushing the cutting edge on capabilities. (clearly this requires non-trivial coordination!)