Given that in the limit (infinite data and infinite parameters in the model) LLM’s are world simulators with tiny simulated humans inside writing text on the internet, the pressure applied to that simulated human is not understanding our world, but understanding that simulated world and be an agent inside that world. Which I think gives some hope.
Of course real world LLM’s are far from that limit, and we have no idea which path to that limit gradient descent takes. Eliezer famously argued about whole “simulator vs predictor” stuff which I think relevant to that intermidiate state far from limit.
Also RLHF applies additional weird pressures, for example a pressure to be aware that it’s an AI (or at least pretend that it’s aware, whatever that might mean), which makes fine-tuned LLM’s actually less save than raw ones.
Given that in the limit (infinite data and infinite parameters in the model) LLM’s are world simulators with tiny simulated humans inside writing text on the internet, the pressure applied to that simulated human is not understanding our world, but understanding that simulated world and be an agent inside that world. Which I think gives some hope.
Of course real world LLM’s are far from that limit, and we have no idea which path to that limit gradient descent takes. Eliezer famously argued about whole “simulator vs predictor” stuff which I think relevant to that intermidiate state far from limit.
Also RLHF applies additional weird pressures, for example a pressure to be aware that it’s an AI (or at least pretend that it’s aware, whatever that might mean), which makes fine-tuned LLM’s actually less save than raw ones.