Learning what we can about how ML algorithms generalize seems very important. The classical philosophy of alignment tends to be very pessimistic about anything like this possibly being helpful. (That is, it is claimed that trying to reward “happiness-producing actions” in the training environment is doomed, because the learned goal will definitely generalize to something not-what-you-meant like “tiling the galaxies with smiley faces.”) That is, of course, the conservative assumption. (We would prefer not to bet the entire future history of the world on AI goals happening to generalize “correctly”, if we had the choice not to bet.) But it would be nice to have more data and not just philosophy. (If the conservative assumption is importantly false, that’s great news; if the conservative assumption can be shown to be true, that could help convince labs to slow down.)
Learning what we can about how ML algorithms generalize seems very important. The classical philosophy of alignment tends to be very pessimistic about anything like this possibly being helpful. (That is, it is claimed that trying to reward “happiness-producing actions” in the training environment is doomed, because the learned goal will definitely generalize to something not-what-you-meant like “tiling the galaxies with smiley faces.”) That is, of course, the conservative assumption. (We would prefer not to bet the entire future history of the world on AI goals happening to generalize “correctly”, if we had the choice not to bet.) But it would be nice to have more data and not just philosophy. (If the conservative assumption is importantly false, that’s great news; if the conservative assumption can be shown to be true, that could help convince labs to slow down.)