Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.
I guess one of the main things I’m worried about is that it seems to require that we either:
Be really good at timing when we pause it to look at its internals, such that we look at the internals after it’s had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.
Current AIs are not representative of what dealing with powerful optimizers is like; when we’ll start getting powerful optimizers, they won’t sit around long enough for us to look at them and ponder, they’ll just quickly eat us.
Seems really wonky and like there could be a lot of things that could go wrong in hard-to-predict ways, but I guess I sorta get the idea.
I guess one of the main things I’m worried about is that it seems to require that we either:
Be really good at timing when we pause it to look at its internals, such that we look at the internals after it’s had long enough to think about things that there are indeed such representations, but not long enough that it started optimizing really hard such that we either {die before we get to look at the internals} or {the internals are deceptively engineered to brainhack whoever would look at them}. If such a time interval even occurs for any amount of time at all.
Have an AI that is powerful enough to have powerful internals-about-QACI to look at, but corrigible enough that this power is not being used to do instrumentally convergent stuff like eat the world in order to have more resources with which to reason.
Current AIs are not representative of what dealing with powerful optimizers is like; when we’ll start getting powerful optimizers, they won’t sit around long enough for us to look at them and ponder, they’ll just quickly eat us.