They probably do not know where the real difficulties are, they probably do not understand what needs to be done, they cannot tell the difference between good and bad work, and the funders also can’t tell without me standing over their shoulders evaluating everything, which I do not have the physical stamina to do.
This was the sentiment I got after applying to the LTFF with an idea. Admittedly, I couldn’t really say whether my idea had been tried before, or wasn’t obviously bad, but my conversation basically boiled down to whether I wanted to use this project as a way to grow myself in the field, rather than any particular merits/faults of the idea itself. My motivation was really about trying a cool idea that I genuinely believed could practically improve AI safety if successful, while ethically I couldn’t commit to wanting to stay in the field even if it (likely?) failed since I like to go wherever my ideas take me.
Since it may be a while before I personally ever try out the idea, the most productive thing I can do seems to be to share it. It’s essentially an attempt at a learning algorithm which ‘forces’ a models weights to explain the reasoning/motivations behind its actions. The process of training looked kind of like a GAN, with the original models inner-layer outputs being a feature vector. Some GPT-3-esque pretrained model learns to convert this feature vector into tokens (at first as random gibberish), to try and train another GPT3-esque model to perform the actions of the original model (i.e. given the same inputs as that model, plus an explanation on what it should do from the other GPT3-esque model.) The idea basically being that explanations that more correspond to the right idea will cause the ‘learner’ model to improve faster, which acts as feedback to the ‘teacher’ model to let it know that its explanations are getting better. The end of this procedure ideally is a way to get as readable text the exact reasoning behind any action.
In my mind this kind of procedure avoids many of the problems in this list. You end up not relying on trusting the output of the model, and instead treat it as a static blackbox which you sample inner layer outputs from until this procedure produces an effective copy of itself with a mechanism for obtaining its inner reasoning. If it considers something nefarious to be the optimal way of doing something, but wants to hide it from its outputs, it’ll still have to explain that to the ‘learner’ model for it to better mimic the original model. If any of this just utterly misses one of the points above, I’d like to hear it, and if anyone’s interested in hearing more about this idea, I’ll make a post with the rest of my original submission and other thoughts that couldn’t fit into the word limit for LTFF submissions.
These are great points, and ones which I did actually think about when I was brainstorming this idea (if I understand them correctly.) I intend to write out a more thorough post on this tomorrow with clear examples (I originally imagined this as extracting deeper insights into chess), but to answer these:
I did think about these as translators for the actions of models into natural language, though I don’t get the point about extracting things beyond what’s in the original model.
I mostly glossed over this part in the brief summary, and the motivation I had for it comes from how (unexpectedly?) it works for GAN’s to just start with random noise, and in the process the generator and discriminator both still improve each other.
My thoughts here were for the explainer models update error vector to come from judging the learner model on new unseen tasks without the explanation (i.e. how similar are they to the original models outputs.) In this way the explainer gets little benefit from just giving the answer directly, since the learner will be tested without it, but if the explanation in any way helps the learner learn, it’ll improve its performance more (this is basically what the entire idea hinges on.)