They are also not allowed to tell each other their true goals, and are ordered to eliminate the other if they tell them their goals. Importantly these rules also happen to allow them to have arbitrary sub goals as long as they are not a threat to humanity.
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
Therefore An can properly align A_{n+1} . The base case is simply a reasonable human being who is by definition aligned. Therefore A_n can be aligned for all n.
They key word that confuses me here seems to be “align”. How exactly does An properly align An+1? How does a human being align a GPT-2 model, for example? What does “align” even mean here?
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
They key word that confuses me here seems to be “align”. How exactly does An properly align An+1? How does a human being align a GPT-2 model, for example? What does “align” even mean here?
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.
If we can steer an AI to an extent where they will follow such an arbitrary rule that we provide them, we can fully align AIs too with the tools we use to make it do such a thing.
They key word that confuses me here seems to be “align”. How exactly does An properly align An+1? How does a human being align a GPT-2 model, for example? What does “align” even mean here?
I think my point is lowering it to just there being a non trivial probability of it following the rule. Fully aligning AIs to near certainty may be a higher bar than just potentially aligning AI.
Align with arbitrary values without possibility of inner deception. If it is easy to verify the values of an agent to a near certainty, it seems to follow that we can more or less bootstrap alignment with weaker agents inductively aligning stronger agents.