Shane, the problem is that there are (for all practical purposes) infinitely many categories the Bayesian superintelligence could consider. They all “identify significant regularities in the environment” that “could potentially become useful.” The problem is that we as the programmers don’t know whether the category we’re conditioning the superintelligence to care about is the category we want it to care about; this is especially true with messily-defined categories like “good” or “happy.” What if we train it to do something that’s just like good except it values animal welfare far more (or less) than our conception of good says it ought to? How long would it take for us to notice? What if the relevant circumstance didn’t come up until after we’d released it?
Shane, the problem is that there are (for all practical purposes) infinitely many categories the Bayesian superintelligence could consider. They all “identify significant regularities in the environment” that “could potentially become useful.” The problem is that we as the programmers don’t know whether the category we’re conditioning the superintelligence to care about is the category we want it to care about; this is especially true with messily-defined categories like “good” or “happy.” What if we train it to do something that’s just like good except it values animal welfare far more (or less) than our conception of good says it ought to? How long would it take for us to notice? What if the relevant circumstance didn’t come up until after we’d released it?