Just from a high level, I think the general style of approaches to AI safety that imagine human researchers solving hard technical challenges in order to make ASI safe enough to deploy widely are fundamentally flawed.
The most likely way to get to extremely safe AGI or ASI systems is not by humans creating them, it’s by other less-safe AGI systems creating them.
“Technological advance is an inherently iterative process. One does not simply take sand from the beach and produce a Dataprobe. We use crude tools to fashion better tools, and then our better tools to fashion more precise tools, and so on.”
For this reason I see approaches that try to make ASI that is provably safe to be mistaken. We should instead just aim to make AGI that’s very capable and steer its use towards the next generation of safer systems, and that steering likely won’t involve proofs.
The most likely way to get to extremely safe AGI or ASI systems is not by humans creating them, it’s by other less-safe AGI systems creating them.
This does seem more likely, but managing to sidestep the less-safe AGI part would be safer. In particular, it might be possible to construct a safe AGI by using safe-if-wielded-responsibly tool AIs (that are not AGIs), if humanity takes enough time to figure out how to actually do that.
I don’t see anything in here that forbids using weaker AI systems to help with this plan. But how do you ever know when you’ve succeeded if it’s not by proofs? proving through ais is not out of the question, it’s a thing that is done to some kinds of safety critical deep learning AIs now.
My general concern with “using AGI to come up with better plans” is not that they will successfully manipulate us into misalignment, but Goodhart us into reinforcing stereotypes of “how good plan should look” or somewhat along this dimension, purely from how RLHF-style steering works.
Humans already do this, except we have made it politically incorrect to talk about the ways in which human-generated Goodhearting make the world worse (race, gender, politics etc)
Your examples are clearly visible. If your wrong alignment paradigm get reinforced because of your attachment to specific model of causality known to ten people in entire world, you risk to notice this too late.
Just from a high level, I think the general style of approaches to AI safety that imagine human researchers solving hard technical challenges in order to make ASI safe enough to deploy widely are fundamentally flawed.
The most likely way to get to extremely safe AGI or ASI systems is not by humans creating them, it’s by other less-safe AGI systems creating them.
“Technological advance is an inherently iterative process. One does not simply take sand from the beach and produce a Dataprobe. We use crude tools to fashion better tools, and then our better tools to fashion more precise tools, and so on.”
For this reason I see approaches that try to make ASI that is provably safe to be mistaken. We should instead just aim to make AGI that’s very capable and steer its use towards the next generation of safer systems, and that steering likely won’t involve proofs.
This does seem more likely, but managing to sidestep the less-safe AGI part would be safer. In particular, it might be possible to construct a safe AGI by using safe-if-wielded-responsibly tool AIs (that are not AGIs), if humanity takes enough time to figure out how to actually do that.
The current paradigm of AI research makes it hard to make really pure tool AIs.
We have software tools, like Wolfram Alpha, and we have LLM-derived systems.
This is probably the set of tools we will either win or die with
I don’t see anything in here that forbids using weaker AI systems to help with this plan. But how do you ever know when you’ve succeeded if it’s not by proofs? proving through ais is not out of the question, it’s a thing that is done to some kinds of safety critical deep learning AIs now.
I think the first step will be using AGIs to come up with better plans.
My general concern with “using AGI to come up with better plans” is not that they will successfully manipulate us into misalignment, but Goodhart us into reinforcing stereotypes of “how good plan should look” or somewhat along this dimension, purely from how RLHF-style steering works.
Humans already do this, except we have made it politically incorrect to talk about the ways in which human-generated Goodhearting make the world worse (race, gender, politics etc)
Your examples are clearly visible. If your wrong alignment paradigm get reinforced because of your attachment to specific model of causality known to ten people in entire world, you risk to notice this too late.
you’re thinking about this the wrong way. AGI governance will not operate like human governance.
Can you elaborate? I don’t understand where we disagree.