Thane Ruthenis comments on Current AIs Provide Nearly No Data Relevant to AGI Alignment

Thane Ruthenis 16 Dec 2023 6:52 UTC
5 points
1
doing a good job of simulating human System 2 thinking over significant periods it was either necessary, or at least a lot cheaper, to supply the needed cognitive abilities via scaffolding
I agree that sufficiently clever scaffolding could likely supply this. But:
- I expect that figuring out what this scaffolding is, is a hard scientific challenge, such that by-default, on the current paradigm, we’ll get to AGI by atheoretic tinkering with architectures rather than by figuring out how intelligence actually works and manually implementing that. (Hint: clearly it’s not as simple as the most blatantly obvious AutoGPT setup.)
- If we get there by figuring out the scaffolding, that’d actually be a step towards a more alignable AGI, in the sense of us getting some idea of how to aim its cognition. Nowhere near sufficient for alignment and robust aimability, but a step in the right direction.
- RogerDearnaley 16 Dec 2023 7:44 UTC
  3 points
  0
  Parent
  All valid points. (Though people are starting to get quite good results out of agentic scaffolds, for short chains of thought, so it’s not that hard, and the promary issue seems to be that exsting LLMs just aren’t consistent enough in their behavior to be able to keep it going for long.)
  On you second bullet: personally I want to build a scaffolding suitable for an AGI-that-is-a-STEM-researcher in which the long term approximate-Bayesian reasoning on theses is something like explicit and mathematical symbol manipulation and/or programmed calculation and/or tool-AI (so a blend of LLM with AIXI-like GOFAI) — since I think then we could safely point it at Value Learning or AI-assisted Alignment and get a system with a basin of attraction converging from partial alignment to increasingly-accurate alignment (that’s basically my current SuperAlignment plan). But then for a sufficiently large transformer model their in-context learning is already approximately Bayesian, so we’d be duplicating an existing mechanism, like RAG duplicating long-term memory when the LLM already has in-context memory. I’m wondering if we could get an LLM sufficiently well-calibrated that we could just use its logits (on a carefully selected token) as a currency of exchange to the long-term approximate Bayesianism calculation: “I have weighed all the evidence and it has shifted my confidence in the thesis… [now compare logits of ‘up’ vs ‘down’, or do a trained linear probe calibrated in logits, or something]