The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It’s also careful how it goes about this seeking of input. For example, if it’s unsure if humans like to be stabbed (we don’t), it wouldn’t stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.
This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).
It’s worth noting that CIRL sometimes goes by other names
There’s some recent academic research on CIRL which is overlooked on LessWrong, Here we seem to only discuss Stuart Russell’s work.
Recent work:
2024 work by @xuanalogue and team
2023 Modeling Boundedly Rational Agents with Latent Inference Budgets Brunke et al., 2022, “Safe learning in robotics: From learning-based control to safe reinforcement learning”
See also this overviews in lecture 3 and 4 of Roger Gross’s CSC2547 Alignment Course.
The most interesting feature is deference: that you can have a pathologically uncertain agent that constantly seeks human input. As part of its uncertainty, It’s also careful how it goes about this seeking of input. For example, if it’s unsure if humans like to be stabbed (we don’t), it wouldn’t stab you to see your reaction, that would be risky! Instead, it would ask or seek out historical evidence.
This is an important safety feature which slows it down, grounds it, and helps avoid risky value extrapolation (and therefore avoids poor convergence).
It’s worth noting that CIRL sometimes goes by other names
It’s also highly related to both assistance games and Recursive Reward Modelling (part of OpenAI’s superalignment).
On the other hand, there are some old rebuttals of parts of it
For example Eleizers article on updated deference talks about the shutdown problem and whether uncertainty really helps