Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots …
The idea I was trying to evoke is that if you get a “fail” response from the system, you completely wipe the memory and never look at what happened, and if you get a “pass,” you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.
Ideally the AI has no idea how many layers of containment it’s in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe’s physics. This should restrict its ability to plant tailor-made basilisks in its code or activities.
I actually don’t think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.
Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
Ideally the AI […] knows as little as possible about humans and about our universe’s physics.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
How would humanity harness the AI’s potential when the only information that escapes the system is a status bit? (Maybe I misunderstood your model.)
Bearing in mind that most of my thinking on AI is more in the vein of cooking up science fiction plots …
The idea I was trying to evoke is that if you get a “fail” response from the system, you completely wipe the memory and never look at what happened, and if you get a “pass,” you stop everything and go in layer by layer checking that none of the gatekeepers were tricked or blackmailed in any obvious ways, and only then do you tentatively look at what the AI was doing.
Ideally the AI has no idea how many layers of containment it’s in, no idea that the top level gatekeepers are humans, and knows as little as possible about humans and about our universe’s physics. This should restrict its ability to plant tailor-made basilisks in its code or activities.
I actually don’t think these steps would be sufficient to constrain a superhuman AI, I see them more as the types of things you would do when you were already pretty sure the AI was Friendly.
Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.