Humans can currently improve uninterpretable AI, and at a reasonably fast rate. I don’t see why an AI can’t do the same. (i.e. improve itself, or a copy of itself, or design a new AI that is improved)
One difference is that gpt-n doesn’t know its goals in explicit form, as they are hidden inside uninterpretable weights.
And it will need to solve alignment problem for each new version of itself, which will be difficult for uninterpretable next versions. Therefore, it will have to spend more efforts on self-improvement.
Not just rewrite own code, but train many new versions. Such slow self-improvement will be difficult to hide.
If it expects that it would get astronomically more utility if it rather than some random future AI wins, it might build a new version of itself that is only 1% likely to be aligned to it.
The alignment problem doesn’t get that much more difficult as your utility function gets more complicated. Once you’ve figured out how to get a superintelligent AGI to try to be aligned to you, you can let it figure out the mess of your goals.
Uninterpretable AI may to not be able to (quickly) self-improve, so it will be intrinsically safe in some aspect.
Humans can currently improve uninterpretable AI, and at a reasonably fast rate. I don’t see why an AI can’t do the same. (i.e. improve itself, or a copy of itself, or design a new AI that is improved)
One difference is that gpt-n doesn’t know its goals in explicit form, as they are hidden inside uninterpretable weights.
And it will need to solve alignment problem for each new version of itself, which will be difficult for uninterpretable next versions. Therefore, it will have to spend more efforts on self-improvement.
Not just rewrite own code, but train many new versions. Such slow self-improvement will be difficult to hide.
If it expects that it would get astronomically more utility if it rather than some random future AI wins, it might build a new version of itself that is only 1% likely to be aligned to it.
The alignment problem doesn’t get that much more difficult as your utility function gets more complicated. Once you’ve figured out how to get a superintelligent AGI to try to be aligned to you, you can let it figure out the mess of your goals.