Roman Leventov comments on Towards a solution to the alignment problem via objective detection and evaluation

Roman Leventov 12 Apr 2023 16:54 UTC
3 points
0
I think that objective is a (future) world state, or probability distribution over world states (or a subset of factors of a world state), wrt. the world model of an intelligent system.

If the world model and cost model and planning model are not disentangled with the AI architecture but are in fact a spaghetti ball within a giant LLM, wrapped as AutoGPT agent, then of course the objectives will hardly be tractable (requires “full” interpretability of LLM’ to be complete, cf. Othello-GPT) and interpretable.

Solution: use AI architectures with explicitly disentangled world models, such as LeCun’s H-JEPA, and then use a separate model for converting the world state (i.e., embedding) into textual description for human interpretability.