Thanks for writing up some of the theory of change for the tiling agents agenda!
I’d be curious on your take on the importance of the Löbian obstacle: I feel like it’s important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I’d be more excited about research (which ultimately aims for pivotal-act level alignment) where we’re fine assuming some “fixed meta level” in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to me that this might make it easier to prove/heuristically-argue-for that the AI will end up with some desirable properties.
Relatedly, I feel like on arbital there were the categories “RSI” and “KANSI”, but AFAICT not clearly some third category like “unknown-algorithm non-full-self-improving (UANFSI?) AI”. (Where IMO current deep learning clearly fits into the third category, though there might be a lot more out there which would too.) I’m currently working on KANSI AI, but if I didn’t I’d be a bit more excited about (formal) UANFSI approaches than full RSI theory, especially since the latter seems to have been tried more. (E.g. I guess I’d classify Vanassa Kosoy’s work as UANFSI, but I didn’t look much at it yet.) (Also there can still be some self-improvement for UANFSI AIs, but as said there would be some meta level that would be fixed.)
But possible I strongly misunderstand something (e.g. maybe the Löbian obstacle isn’t that central?).
(In any case I think there ought to be multiple people continuing this line of work.)
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren’t precisely true).
I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb’s Theorem with Payor’s Lemma in this context.
I don’t completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.
(I don’t fully understand yet what results your aiming for, but yeah makes sense that probabilistic guarantees make some stuff more feasible. Not sure whether there might be more relaxations I’d be fine to at least initially make.)
Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it’s easier to provide probabilistic convergence guarantees for systems that don’t do FSI so this would already give some implicit evidence. But we could also just say that it’s fine if FSI happens as long as we have heuristic convergence arguments—like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier—though i mostly don’t expect we’d get FSI alignment through this indirect alignment path from UANFSI but that we’d get an NFSI AI if we get some probabilistic convergence guarantees.
(Also I didn’t think much about it at all. As said I’m trying KANSI for now.)
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. “Unknown Algorithm” to me means that we don’t know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can’t just think about what it “converges to” (convergence time will be too long for interesting learning systems).
Hm interesting. I mean I’d imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won’t be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)
(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set up training setups within two decades. Like for being able to get good heuristic guarantees I think we’d need to at least figure out at least something sorta like the steering subsystem which tries to align the human brain, only better because it’s not good enough for smart humans I’d say. (Though Steven Byrnes’ agenda is perhaps a UANFSI approach that might have sorta a shot because it might open up possibilities of studying in more detail how values form in humans. Though it’s a central example of what I was imagining when I coined the term.))
Thanks for writing up some of the theory of change for the tiling agents agenda!
I’d be curious on your take on the importance of the Löbian obstacle: I feel like it’s important to do this research for aligning full-blown RSI-to-superintelligence, but at the same time it introduces quite some extra difficulty, and I’d be more excited about research (which ultimately aims for pivotal-act level alignment) where we’re fine assuming some “fixed meta level” in the learning algorithm but general enough that the object-level AI can get very powerful. It seems to me that this might make it easier to prove/heuristically-argue-for that the AI will end up with some desirable properties.
Relatedly, I feel like on arbital there were the categories “RSI” and “KANSI”, but AFAICT not clearly some third category like “unknown-algorithm non-full-self-improving (UANFSI?) AI”. (Where IMO current deep learning clearly fits into the third category, though there might be a lot more out there which would too.) I’m currently working on KANSI AI, but if I didn’t I’d be a bit more excited about (formal) UANFSI approaches than full RSI theory, especially since the latter seems to have been tried more. (E.g. I guess I’d classify Vanassa Kosoy’s work as UANFSI, but I didn’t look much at it yet.) (Also there can still be some self-improvement for UANFSI AIs, but as said there would be some meta level that would be fixed.)
But possible I strongly misunderstand something (e.g. maybe the Löbian obstacle isn’t that central?).
(In any case I think there ought to be multiple people continuing this line of work.)
I have lost interest in the Löbian approach to tiling, because probabilistic tiling results seem like they can be strong enough and with much less suspicious-looking solutions. Expected value maximization is a better way of looking at agentic behavior anyway. Trying to logically prove some safety predicate for all actions seems like a worse paradigm than trying to prove some safety properties for the system overall (including proving that those properties tile under as-reasonable-as-possible assumptions, plus sanity-checking what happens when those assumptions aren’t precisely true).
I do think Löb-ish reasoning still seems potentially important for coordination and cooperation, which I expect to feature in important tiling results (if this research program continues to make progress). However, I am optimistic about replacing Löb’s Theorem with Payor’s Lemma in this context.
I don’t completely discount the pivotal-act approach, but I am currently more optimistic about developing safety criteria & designs which could achieve some degree of consensus amongst researchers, and make their way into commercial AI, perhaps through regulation.
Thanks!
(I don’t fully understand yet what results your aiming for, but yeah makes sense that probabilistic guarantees make some stuff more feasible. Not sure whether there might be more relaxations I’d be fine to at least initially make.)
How would you become confident that a UANFSI approach was NFSI?
Idk that could be part of finding heuristic arguments for desireable properties for what an UANFSI converges to. Possibly it’s easier to provide probabilistic convergence guarantees for systems that don’t do FSI so this would already give some implicit evidence. But we could also just say that it’s fine if FSI happens as long as we have heuristic convergence arguments—like that UANFSI is just allowing for a broader class of algorithms which might make stuff easier—though i mostly don’t expect we’d get FSI alignment through this indirect alignment path from UANFSI but that we’d get an NFSI AI if we get some probabilistic convergence guarantees.
(Also I didn’t think much about it at all. As said I’m trying KANSI for now.)
I think there are some deeper insights around inner optimization that you are missing that would make you more pessimistic here. “Unknown Algorithm” to me means that we don’t know how to rule out the possibility of inner agents which have opinions about recursive self-improvement. Part of it is that we can’t just think about what it “converges to” (convergence time will be too long for interesting learning systems).
Hm interesting. I mean I’d imagine that if we get good heuristic guarantees for a system it would basically mean that all the not-perfectly-aligned subsystems/subsearches are limited and contained enough that they won’t be able to engage in RSI. But maybe I misunderstand your point? (Like maybe you have specific reason to believe that it would be very hard to predict reliably that a subsystem is contained enough to not engage in RSI or so?)
(I think inner alignment is very hard and humans are currently not (nearly?) competent enough to figure out how to set up training setups within two decades. Like for being able to get good heuristic guarantees I think we’d need to at least figure out at least something sorta like the steering subsystem which tries to align the human brain, only better because it’s not good enough for smart humans I’d say. (Though Steven Byrnes’ agenda is perhaps a UANFSI approach that might have sorta a shot because it might open up possibilities of studying in more detail how values form in humans. Though it’s a central example of what I was imagining when I coined the term.))