Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
Ideally the AI […] knows as little as possible about humans and about our universe’s physics.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.
Ok, I take it by “one-way-blind” you mean that each layer gets no new information that is not already in its database, but what is explicitly controlled by the humans. (E.g. I guess each layer should know the human query, in order to evaluate if AI’s answer is manipulative.)
I also understand that we do look at complex information given by the AI, but only if the security bit signals “ok”.
That seems problematic, as these kinds of knowledge will be crucial for the optimization we want the AI to calculate.