Submission: low bandwidth and maybe counterfactual Oracle.
General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.
Path 1
1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory of GO is erased after each turn.
2. We create a large database of presumably good or bad oracles (without running any). Humans do not see codes of any of them, so the effective output is 0.
3. We use different types of GOs to fill our database of good and bad oracles (to prevent their coordination).
4. We test Meta Oracle on this database—it should correctly guess which ones are bad and good ones.
5. If the list of MO-guesses doesn’t match with our commands to create safe or not safe oracles, it means that at least one of GOs or the MO is malevolent or erroneous, so all of them should be ditched.
6. Even if MO is not trustworthy, it can’t lie to us too often, or it will be caught and ditched. So, basically, we do here something like a GAN network, but consisting of oracles.
7. Using very large database, we could increase our credence in the truth of the next output of MO to an arbitrary high number, like 99.999.
8. Eventually we have a code of an Oracle, which is very likely to be safe.
Path 11. Another way to test MO:
1. We create many different MOs and test them against database of oracles codes from above. They don’t have memory and can’t communicate.
2. If small minority of MOs disagree with majority of MO about some oracle O1, we ditch this minority. Truth is a Shelling point for MOs, so the easiest way for them to coordinate is to say truth. (They could coordinate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the majority of other MO are also cheaters; however, not all oracles-codes will have such coordination texts, or the oracles-code could be created deliberately with such line to caught MO which are eager to cheat).
3. Running 2 many times (millions) we will end with a subset of MOs which are unlikely to cheat, and thus we will have many Oracle codes which are likely safe.
Path 111. Another way to test Meta Oracle:
1. Create many MOs with different principles and initial conditions.
2. Test a code of each MO against each MO (In other words, Meta-Oracle test itself—or its siblings)
3. If all came positive, it means that either all of them are actually good ones, or all of them were able to acausally coordinate and all of them are malevolent.
4. To increase our credence, ask generative Oracle to create malevolent meta Oracles, and add them to the mix of mutually testing MOs, so they can’t just blindly cooperate anв produce all positive results.
This ideas are still vulnerable and may be not mathematically perfect by may help to make safer Oracles.
Submission: low bandwidth and maybe counterfactual Oracle.
General idea: Use a Meta-Oracle to check safety of Oracles. In the nutshell, Meta-Oracle (MO) just outputs 1 or 0, safe on not safe, after getting a code of another Oracle. If the Meta-Oracle is trustworthy, when the Oracle it deemed safe is also trustworthy. However, we need to create a procedure to check correctness of MO.
Path 1
1. We create Generator Oracle (GO), which inputs 1 or 0, and in the case of 0, it outputs a random code of malevolent Oracle, and for 1 – a code of a safe Oracle. Memory of GO is erased after each turn.
2. We create a large database of presumably good or bad oracles (without running any). Humans do not see codes of any of them, so the effective output is 0.
3. We use different types of GOs to fill our database of good and bad oracles (to prevent their coordination).
4. We test Meta Oracle on this database—it should correctly guess which ones are bad and good ones.
5. If the list of MO-guesses doesn’t match with our commands to create safe or not safe oracles, it means that at least one of GOs or the MO is malevolent or erroneous, so all of them should be ditched.
6. Even if MO is not trustworthy, it can’t lie to us too often, or it will be caught and ditched. So, basically, we do here something like a GAN network, but consisting of oracles.
7. Using very large database, we could increase our credence in the truth of the next output of MO to an arbitrary high number, like 99.999.
8. Eventually we have a code of an Oracle, which is very likely to be safe.
Path 11. Another way to test MO:
1. We create many different MOs and test them against database of oracles codes from above. They don’t have memory and can’t communicate.
2. If small minority of MOs disagree with majority of MO about some oracle O1, we ditch this minority. Truth is a Shelling point for MOs, so the easiest way for them to coordinate is to say truth. (They could coordinate around the code of O1 if it has a line of code like “let’s lie about me” and if they are sure that the majority of other MO are also cheaters; however, not all oracles-codes will have such coordination texts, or the oracles-code could be created deliberately with such line to caught MO which are eager to cheat).
3. Running 2 many times (millions) we will end with a subset of MOs which are unlikely to cheat, and thus we will have many Oracle codes which are likely safe.
Path 111. Another way to test Meta Oracle:
1. Create many MOs with different principles and initial conditions.
2. Test a code of each MO against each MO (In other words, Meta-Oracle test itself—or its siblings)
3. If all came positive, it means that either all of them are actually good ones, or all of them were able to acausally coordinate and all of them are malevolent.
4. To increase our credence, ask generative Oracle to create malevolent meta Oracles, and add them to the mix of mutually testing MOs, so they can’t just blindly cooperate anв produce all positive results.
This ideas are still vulnerable and may be not mathematically perfect by may help to make safer Oracles.