If we build a prediction model for reward function, maybe an transformer AI, run it in a range of environments where we already have the credit assignment solved, we could use that model to estimate what would be some candidate goals in another environments. That could help us discover alternative/candidate reward functions for worlds/envs where we are not sure on what to train there with RL and it could show some latent thinking processes of AIs, perhaps clarify instrumental goals to more nuance.
If we build a prediction model for reward function, maybe an transformer AI, run it in a range of environments where we already have the credit assignment solved, we could use that model to estimate what would be some candidate goals in another environments.
That could help us discover alternative/candidate reward functions for worlds/envs where we are not sure on what to train there with RL and
it could show some latent thinking processes of AIs, perhaps clarify instrumental goals to more nuance.
This (not so old )concept seems relevant
> IRL is about learning from humans.
Inverse reinforcement learning (IRL) is the field of learning an agent’s objectives, values, or rewards by observing its behavior. @https://towardsdatascience.com/inverse-reinforcement-learning-6453b7cdc90d
I gotta read that later.