Ulisse Mini comments on Models Don’t “Get Reward”

Ulisse Mini 19 Jan 2023 5:37 UTC
2 points
0
Seems tangentially related to the train a sequence of reporters strategy for ELK. They don’t phrase it in terms of basins and path dependence, but they’re a great frame to look at it with.

Personally, I think supervised learning has low path-dependence because of exact gradients plus always being able find a direction to escape basins in high dimensions, while reinforcement learning has high path-dependence because updates influence future training data causing attractors/equilibra (more uncertain about the latter, but that’s what I feel like)

So the really out there take: We want to give the LLM influence over its future training data in order to increase path-dependence, and get the attractors we want ;)