Your first point is indeed an issue, and I’m thinking about it. The second is less of a problem, because now we have a goal description, so implementing the goal is less of an issue.
Possibly a third adversarial AI? Have an AI that generates the questions based on P, is rewarded if the second AI evaluates their probability as close to 50%, is rewarded for the first AI being able to get them right based on P’, and for the human getting them wrong.
That’s probably not quite right; we want the AI to generate hard but not impossible questions. Possibly some sort of term about the AIs predicting whether the human will get a question right?
Your first point is indeed an issue, and I’m thinking about it. The second is less of a problem, because now we have a goal description, so implementing the goal is less of an issue.
Possibly a third adversarial AI? Have an AI that generates the questions based on P, is rewarded if the second AI evaluates their probability as close to 50%, is rewarded for the first AI being able to get them right based on P’, and for the human getting them wrong.
That’s probably not quite right; we want the AI to generate hard but not impossible questions. Possibly some sort of term about the AIs predicting whether the human will get a question right?