Conventionally in machine learning, if you want to learn to minimize some loss or maximize some expected return, you do so by sampling a bunch of losses/rewards and training on those. Since the model only ever sees the loss or reward function through the lens of those specific samples, this basic approach introduces a proxy alignment problem.
For example, suppose you train an RL agent to maximize its future discounted return according to some reward function . Furthermore, suppose there exists some other reward function such that and give equivalent samples on the training distribution, but diverge elsewhere. If you just train your agent via evaluating on a bunch of samples, however, then even if your model is in some sense trying to do the right thing, it has no possible way of knowing whether or is the right generalization.
In many cases, however, we know exactly what is—we have explicit code for it and everything (or at least some sort of natural language description of it)—but we still only make use of via sampling/evaluation. Of course, in many RL settings, you actually do only know how to evaluate , not inspect it in any other way. However, I think a system that only works in settings where you have more access to the reward function than that can still do quite a lot—even if you explicitly know an environment’s reward function, it can still be quite difficult to figure out the optimal policy (think Go, for example) such that having an ML system which can figure it out for you is quite powerful.
So, here’s my question: at least for environments in which you have a known reward function, what are some ways of making use of that information in training a deep learning model other than evaluating that reward function on a bunch of samples? I’m also interested in ways of doing this in non-RL settings, though I still mostly only want to focus on deep learning approaches—there are certainly ways of doing this in more classical machine learning, but I’m less interested in those.
Some possibilities that I’ve considered so far:
Put a differentiable copy of the reward function inside the network during training such that the network is able to arbitrarily query the reward function however it wants (credit to Nevan Wichers for this idea). For a smooth reward function you could also give your model the ability to explicitly query gradients as well.
Express your reward function as a differentiable function with tunable parameters, put a bunch of copies in your network, and then train without freezing those tunable parameters (or maybe freeze for the first steps then unfreeze). This specific implementation seems pretty janky, but the basic idea here is to find a way to bias the network towards learning an algorithm that includes an objective that’s similar to the actual reward function.
Using transparency/interpretability tools, figure out how the model is internally representing the reward function and then enforce that it do so in a way that maps correctly onto the actual reward function.
Use a language model to make sense of a natural language description of your reward function in a way that allows it to act as an RL agent. For example, you could fine-tune a language model on the task of mapping natural-language descriptions of reward functions into optimal actions under that reward.
Same as the language model idea, but instead of using natural language, use some sort of mathematical/logical/programming language instead. For example, you might be able to do something like this if you had a powerful deep-learning-based theorem prover.
(EDIT) Here’s another example: do MuZero-style planning where you learn all the dynamics necessary to do model-based planning in a model-free way except for the reward function and then include the reward function explicitly.
I’m sure there are other possibilities that I haven’t thought about yet, however—possibly including papers on this in the literature that I’m not familiar with. Any ideas?
This doesn’t strike directly at the sampling question, but it is related to several of your ideas about incorporating the differentiable function: Neural Ordinary Differential Equations.
This is being exploited most heavily in the Julia community. The broader pitch is that they have formalized the relationship between differential equations and neural networks. This allows things like:
applying differential equation tricks to computing the outputs of neural networks
using neural networks to solve pieces of differential equations
using differential equations to specify the weighting of information
The last one is the most intriguing to me, mostly because it solves the problem of machine learning models having to start from scratch even in environments where information about the environment’s structure is known. For example, you can provide it with Maxwell’s Equations and then it “knows” electromagnetism.
There is a blog post about the paper and using it with the DifferentialEquations.jl and Flux.jl libraries. There is also a good talk by Christopher Rackauckas about the approach.
It is mostly about using ML in the physical sciences, which seems to be going by the name Scientific ML now.
This is really neat; thanks for the pointer!
I don’t know what the procedure for this is, but it occurs to me that if we can specify information about an environment via differential equations inside the neural network, then we can also compare this network’s output to one that doesn’t have the same information.
In the name of learning more about how to interpret the models, we could try something like:
1) Construct an artificial environment which we can completely specify via a set of differential equations.
2) Run a neural network to learn that environment with every combination of those differential equations.
3) Compare all of these to several control cases of not providing any differential equations.
It seems like how the control case differs from each of the cases-with-structural-information should give us some information about how the network learns the environmental structure.