I liked how in your AISS support talk you used history as a frame for thinking about this because it highlights the difficulty of achieving superhuman ethics. Human ethics (for instance as encoded in laws/rights/norms) is improving over time, but it’s been a very slow process that involves a lot of stumbling around and having to run experiments to figure out what works and what doesn’t. “The Moral Arc” by Michael Shermer is about the causes of moral progress… one of them is allowing free speech, free flow of ideas. Basically, it seems moral progress requires a culture that supports conjecture and criticism of many ideas—that way you are more likely to find the good ideas. How you get an AI to generate new ideas is anyone’s guess—“creativity” in AI is pretty shallow right now—I am not aware of any AI having invented anything useful. (There have been news reports about AI systems that have found new drugs, but the ones I’ve seen were actually later called out as just slight modifications of existing drugs that were in their training data and thus they were not super creative).
To be honest I only read sections I-III of this post.
I have a comment on this:
An even more speculative thing to try would be auto-supervision. A language model can not only be asked to generate text about ethical dilemmas, it can also be asked to generate text about how good different responses to ethical dilemmas are, and the valence of the response can be used as a reinforcement signal on the object-level decision.
This is a nice idea. It’s easy to implement and my guess is it should improve consistency. I actually saw something similar done in computer vision—someone took the labels generated by a CNN on a previously unlabeled dataset and then used those to fine-tune the CNN. Surprisingly, the result was a slightly better model. I think what that process does is encourage consistency across a larger swatch of data. I’m having trouble finding the paper right now however and I have no idea if the result replicated. If you would like I can try to find it—I think it was in the medical imaging domain where data labeled with ground truth labels is scarce, so if you can train on autogenerated (“weak”) labels then that is super useful.
I liked how in your AISS support talk you used history as a frame for thinking about this because it highlights the difficulty of achieving superhuman ethics. Human ethics (for instance as encoded in laws/rights/norms) is improving over time, but it’s been a very slow process that involves a lot of stumbling around and having to run experiments to figure out what works and what doesn’t. “The Moral Arc” by Michael Shermer is about the causes of moral progress… one of them is allowing free speech, free flow of ideas. Basically, it seems moral progress requires a culture that supports conjecture and criticism of many ideas—that way you are more likely to find the good ideas. How you get an AI to generate new ideas is anyone’s guess—“creativity” in AI is pretty shallow right now—I am not aware of any AI having invented anything useful. (There have been news reports about AI systems that have found new drugs, but the ones I’ve seen were actually later called out as just slight modifications of existing drugs that were in their training data and thus they were not super creative).
To be honest I only read sections I-III of this post.
I have a comment on this:
This is a nice idea. It’s easy to implement and my guess is it should improve consistency. I actually saw something similar done in computer vision—someone took the labels generated by a CNN on a previously unlabeled dataset and then used those to fine-tune the CNN. Surprisingly, the result was a slightly better model. I think what that process does is encourage consistency across a larger swatch of data. I’m having trouble finding the paper right now however and I have no idea if the result replicated. If you would like I can try to find it—I think it was in the medical imaging domain where data labeled with ground truth labels is scarce, so if you can train on autogenerated (“weak”) labels th
en that is super useful.