Would you think that the following approach would fit within “in addition to making alignment your top priority and working really hard to over-engineer your system for safety, also build the system to have the bare minimum of capabilities” and possibly work, or would you think that it would be hopelessly doomed?
Work hard on designing the system to be safe
But there’s some problem left over that you haven’t been able to fully solve, and think will manifest at a certain scale (level of intelligence/optimization power/capabilities)
Run the system, but limit scale to stay well within the range where you expect it to behave well
I think you’re probably in a really bad state if you have to lean very much on that with your first AGI system. You want to build the system to not optimize any harder than absolutely necessary, but you also want the system to fail safely if it does optimize a lot harder than you were expecting.
The kind of AGI approach that seems qualitatively like “oh, this could actually work” to me involves more “the system won’t even try to run searches for solutions to problems you don’t want solved” and less “the system tries to find those solutions but fails because of roadblocks you put in the way (e.g., you didn’t give it enough hardware)”.
Would you think that the following approach would fit within “in addition to making alignment your top priority and working really hard to over-engineer your system for safety, also build the system to have the bare minimum of capabilities” and possibly work, or would you think that it would be hopelessly doomed?
Work hard on designing the system to be safe
But there’s some problem left over that you haven’t been able to fully solve, and think will manifest at a certain scale (level of intelligence/optimization power/capabilities)
Run the system, but limit scale to stay well within the range where you expect it to behave well
I think you’re probably in a really bad state if you have to lean very much on that with your first AGI system. You want to build the system to not optimize any harder than absolutely necessary, but you also want the system to fail safely if it does optimize a lot harder than you were expecting.
The kind of AGI approach that seems qualitatively like “oh, this could actually work” to me involves more “the system won’t even try to run searches for solutions to problems you don’t want solved” and less “the system tries to find those solutions but fails because of roadblocks you put in the way (e.g., you didn’t give it enough hardware)”.