Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.
This seems to potentially significantly accelerate AI safety research while being safe since it’s just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn’t secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.
Stuart, does it count against my entry that it’s not actually a very novel idea? (If so, I might want to think about other ideas to submit.)
What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators for one proposal)? It seems like there are lots of posts and comments about human imitations spread over LW, Arbital, Paul’s blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.
This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers.
So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a “status quo” bonus that won’t hideously backfire, or something?)
(I’m still confused and thinking about this, but figure I might as well write this down before someone else does. :)
While thinking more about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?
Submission. For the counterfactual Oracle, ask the Oracle to predict the n best posts on AF during some future time period (counterfactually if we didn’t see the Oracle’s answer). In that case, reward function is computed as similarity between the predicted posts and the actual top posts on AF as ranked by karma, with similarity computed using some ML model.
This seems to potentially significantly accelerate AI safety research while being safe since it’s just showing us posts similar to what we would have written ourselves. If the ML model for measuring similarity isn’t secure, the Oracle might produce output that attack the ML model, in which case we might need to fall back to some simpler way to measure similarity.
It looks like my entry is pretty close to the ideas of Human-in-the-counterfactual-loop and imitation learning and apprenticeship learning. Questions:
Stuart, does it count against my entry that it’s not actually a very novel idea? (If so, I might want to think about other ideas to submit.)
What is the exact relationship between all these ideas? What are the pros and cons of doing human imitation using this kind of counterfactual/online-learning setup, versus other training methods such as GAN (see Safe training procedures for human-imitators for one proposal)? It seems like there are lots of posts and comments about human imitations spread over LW, Arbital, Paul’s blog and maybe other places, and it would be really cool if someone (with more knowledge in this area than I do) could write a review/distillation post summarizing what we know about it so far.
I encourage you to submit other ideas anyway, since your ideas are good.
Not sure yet about how all these things relate; will maybe think of that more later.
What if another AI would have counterfactually written some of those posts to manipulate us?
If that seems a realistic concern during the time period that the Oracle is being asked to predict, you could replace the AF with a more secure forum, such as a private forum internal to some AI safety research team.
This seems incredibly dangerous if the Oracle has any ulterior motives whatsoever. Even – nay, especially – the ulterior motive of future Oracles being better able to affect reality to better resemble their provided answers.
So, how can we prevent this? Is it possible to produce an AI with its utility function as its sole goal, to the detriment of other things that might… increase utility, but indirectly? (Is there a way to add a “status quo” bonus that won’t hideously backfire, or something?)
(I’m still confused and thinking about this, but figure I might as well write this down before someone else does. :)
While thinking more about my submission and counterfactual Oracles in general, this class of ideas for using CO is starting to look like trying to implement supervised learning on top of RL capabilities, because SL seems safer (less prone to manipulation) than RL. Would it ever make sense to do this in reality (instead of just doing SL directly)?