Jozdien comments on Trying to isolate objectives: approaches toward high-level interpretability

Jozdien 11 Jan 2023 18:43 UTC
LW: 2 AF: 1
0
AF
Yeah, this is definitely something I consider plausible. But I don’t have a strong stance because RL mechanics could lead to there being an internal search process for toy models (unless this is just my lack of awareness of some work that proves otherwise). That said, I definitely think that work on slightly larger models would be pretty useful and plausibly alleviates this, and is one of the things I’m planning on working on.
- TurnTrout 12 Jan 2023 6:30 UTC
  LW: 3 AF: 3
  0
  AF Parent
  Yeah, IMO “RL at scale trains search-based mesa optimizers” hypothesis predicts “solving randomly generated mazes via a roughly unitary mesa objective and heuristic search” with reasonable probability, and that seems like a toy domain to me.
- Dalcy 11 Jan 2023 18:57 UTC
  2 points
  0
  Parent
  One thing I imagine might be useful even in small training regimes would be to train on tasks where the only possible solution necessarily involves a search procedure, i.e. “search-y tasks”
  For example, it’s plausible that simple heuristics aren’t sufficient to get you to superhuman-level on tasks like Chess or Go, so a superhuman RL performance on these tasks would be a fairly good evidence that the model already has an internal search process.
  But one problem with Chess or Go would be that the objective is fixed, i.e. the game rules. So perhaps one way to effectively isolate objectives in small training regimes is to find tasks that are both “search-y” and can be modified to have modularly varying objectives eg Chess, but with various possible game rules.
  - Jozdien 11 Jan 2023 19:21 UTC
    1 point
    0
    Parent
    Oh yeah I agree—I was thinking more along the lines of that small models would end up with heuristics even for some tasks that require search to do really well, because they may have slightly complex heuristics learnable by models of that size that allow okay performance relative to the low-power search they would otherwise be capable of. I agree that this could make a quantitative difference though and hadn’t thought explicitly of structuring the task along this frame, so thanks!