creatively come up with new hypotheses, i.e. create new candidate world models
devise and carry out low-cost/risk experiments to distinguish between world models
I think it’s going to be hard to do all of these things well if its world models aren’t fairly modular and separable from the rest of its mental architecture.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.
Yeah, that’s where my current thinking is at as well. I wouldn’t term it as having “multiple world models” — rather, as entertaining multiple possible candidates for the structure of some region of its world-model — but yes, I think we can say a lot about the convergent shape of world-models by reasoning from the idea that they need to be easy to adapt and recompose based on new evidence.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT
I’ve also had this idea, as a steelman of the whole “externalized reasoning oversight” agenda — to prompt a LLM to generate a semantical world-model, with the LLM itself just playing the role of the planning process over it. However, I expect it wouldn’t work as intended, for two reasons:
Inasmuch as it’s successful, the world-model is unlikely to stay naively-human-interpretable. It’d drift towards alien wordings, concepts, connections. And even if we force it to look human-interpretable, stenography is convergent, and this sort of setup opens up many more dimensions in which to sneak in messages than standard chains-of-thoughts. And if we manage to defeat stenography as well, I then expect a WM to be forced to look like terabytes upon terabytes of complexly-interconnected text, each plan-making query on it generating mountains of data — perhaps too much to reasonably sort out. Tying-in to...
It’ll probably be too computationally intensive to work at all. Humans explicitly running generally-intelligent queries on their world-models takes a lot of time already, compared to the speed at which our instincts work. If each step of a query required a whole LLM forward-pass, instead of the minimal function required for it? I expect it’d require orders of magnitude more compute than Earth is going to have in the near-term.
And these two points aren’t independent: the more human-interpretable we’d force the WM to look, the more wasteful and impractical it’d be.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.
Cool! I am working on something that is fairly similar (with a bunch of additional safety considerations). I don’t go too deeply into the architecture in my article, but would be curious what you think!
I’m not very scared of any AGI that isn’t capable of being a scientist — it seems unlikely to be able to go FOOM. In order to do that, it needs to:
have multiple world models at the same time that disagree, and reason under uncertainty across them
do approximate Bayesian updates on their probability
plan conservatively under uncertainty, i.e have broken the Optimizer’s Curse
creatively come up with new hypotheses, i.e. create new candidate world models
devise and carry out low-cost/risk experiments to distinguish between world models
I think it’s going to be hard to do all of these things well if its world models aren’t fairly modular and separable from the rest of its mental architecture.
One possibility that I find plausible as a path to AGI is if we design something like a Language Model Cognitive Architecture (LMCA) along the lines of AutoGPT, and require that its world model actually be some explicit combination of human natural language, mathematical equations, and executable code that might be fairly interpretable to humans. Then the only potions of its world model that are very hard to inspect are those embedded in the LLM component.
Yeah, that’s where my current thinking is at as well. I wouldn’t term it as having “multiple world models” — rather, as entertaining multiple possible candidates for the structure of some region of its world-model — but yes, I think we can say a lot about the convergent shape of world-models by reasoning from the idea that they need to be easy to adapt and recompose based on new evidence.
I’ve also had this idea, as a steelman of the whole “externalized reasoning oversight” agenda — to prompt a LLM to generate a semantical world-model, with the LLM itself just playing the role of the planning process over it. However, I expect it wouldn’t work as intended, for two reasons:
Inasmuch as it’s successful, the world-model is unlikely to stay naively-human-interpretable. It’d drift towards alien wordings, concepts, connections. And even if we force it to look human-interpretable, stenography is convergent, and this sort of setup opens up many more dimensions in which to sneak in messages than standard chains-of-thoughts. And if we manage to defeat stenography as well, I then expect a WM to be forced to look like terabytes upon terabytes of complexly-interconnected text, each plan-making query on it generating mountains of data — perhaps too much to reasonably sort out. Tying-in to...
It’ll probably be too computationally intensive to work at all. Humans explicitly running generally-intelligent queries on their world-models takes a lot of time already, compared to the speed at which our instincts work. If each step of a query required a whole LLM forward-pass, instead of the minimal function required for it? I expect it’d require orders of magnitude more compute than Earth is going to have in the near-term.
And these two points aren’t independent: the more human-interpretable we’d force the WM to look, the more wasteful and impractical it’d be.
Cool! I am working on something that is fairly similar (with a bunch of additional safety considerations). I don’t go too deeply into the architecture in my article, but would be curious what you think!