All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]
All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]