Ah, I should have written that question differently. I meant to ask “If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?”
How do you do this?
Anyways, we might train a diamond-values agent like this.
Ah, I should have written that question differently. I meant to ask “If we cannot robustly grade expected-diamond-production for every plan the agent might consider, how might we nonetheless design a smart agent which makes lots of diamonds?”
Anyways, we might train a diamond-values agent like this.