May be we could also add an onboard predictor of the Catastrophic Oracle? It works as following:
1) In a game environment, agent A suggests many transcripts which are evaluated by the Catastrophe Oracle (CO).
2) Based on this data, the agent A learns to predict the output of the Catastrophic Oracle; in other words, there is a separate neural net or other best available machine learning system which learns to predict the Oracle’s output. This is a CO-predictor.
3) When acting in real world in a new situation and with a new transcript, but when there is no time to actually query the real Catastrophic Oracle, the agent predicts the possible output of the Oracle based on the function learned on Step 2 and chose the strategies which are less likely to trigger the Catastrophic Oracle.
4) Between all strategies which will not trigger the CO predictor, the agent chose those which small variations will also will not trigger CO predictor. This is needed to prevent complex crazy plans to escape CO (E.g. if a self-driving car tries to pass between two pedestrians via drifting, it is a fragile strategy).
Example: a child learns that her mother is angry when she broke a cup and some other things. Now when she takes a fragile thing, she could imagine that she could broke it and that her mother will be unhappy; in other words, she internalised parent control via her ability to predict anger of her mother.
May be we could also add an onboard predictor of the Catastrophic Oracle? It works as following:
1) In a game environment, agent A suggests many transcripts which are evaluated by the Catastrophe Oracle (CO).
2) Based on this data, the agent A learns to predict the output of the Catastrophic Oracle; in other words, there is a separate neural net or other best available machine learning system which learns to predict the Oracle’s output. This is a CO-predictor.
3) When acting in real world in a new situation and with a new transcript, but when there is no time to actually query the real Catastrophic Oracle, the agent predicts the possible output of the Oracle based on the function learned on Step 2 and chose the strategies which are less likely to trigger the Catastrophic Oracle.
4) Between all strategies which will not trigger the CO predictor, the agent chose those which small variations will also will not trigger CO predictor. This is needed to prevent complex crazy plans to escape CO (E.g. if a self-driving car tries to pass between two pedestrians via drifting, it is a fragile strategy).
Example: a child learns that her mother is angry when she broke a cup and some other things. Now when she takes a fragile thing, she could imagine that she could broke it and that her mother will be unhappy; in other words, she internalised parent control via her ability to predict anger of her mother.