Submission. “Hacking/phishing assistant.” For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they’ll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or creation of self-replicating code), but potentially more capable than using humans or human imitators to do hacking/phishing because the Oracle can model the target better than humans can. (ETA: This idea could be combined with a human imitator to make the system faster / more capable.)
Submission. “Hacking/phishing assistant.” For the counterfactual Oracle, ask the Oracle to predict what would happen if one were to send a message/data/command to some hacking/phishing (human or machine) target. In the event of erasure, actually send that message to the target and use the actual response to train the Oracle. Note this is safer than using RL to automate hacking/phishing because humans are coming up with candidate messages to send (so they’ll avoid messages that could cause bad side-effects such as psychological damage to the recipient, or creation of self-replicating code), but potentially more capable than using humans or human imitators to do hacking/phishing because the Oracle can model the target better than humans can. (ETA: This idea could be combined with a human imitator to make the system faster / more capable.)