~[agent foundations]
Mateusz Bagiński
After returning to OpenAI just five days after he was ousted, Mr. Altman reasserted his control and continued its push toward increasingly powerful technologies that worried some of his critics. Dr. Sutskever remained an OpenAI employee, but he never returned to work.
Had this been known until now? I didn’t know he “didn’t get back to work” although admittedly I wasn’t tracking the issue very closely.
More or less quoting from autogenerated subtitles:
So, this remains very speculative despite years of kind of thinking about this and I’m not saying something like “linear logic is just the right thing”. I feel there is something deeply wrong about linear logic as well but I’ll give my argument.
So my pitch is like this: we want to understand where does the sigma algebra come from or where does the algebra come from. And if it is a sigma algebra, could we justify that? And if it’s not, what’s the right appropriate thing that we get instead, that sort of naturally falls out? My argument that the thing that naturally falls out is probably more like linear logic than like a sigma algebra is like this.
As many of you may know I like logical induction a whole lot. I think that it’s right to think of beliefs as not completely coherent and instead we want systems that have these approximately coherent beliefs. And how do you do that? Well a kind of generic way to do that is with a market because in a market any sort of easily computed non-coherence will be pumped out of the market by a trader who recognizes that incoherence and takes advantage of it for a money pump and thereby gains wealth and at some point has enough wealth that it’s just enforcing that notion of coherence. I think that’s a great picture but if we are imagining beliefs as having this kind of type signature of things on a market that can be traded then the natural algebra for beliefs is basically like an algebra of derivatives.
So we have a bunch of basic goods. Some of them represent probabilities because eventually, we anticipate that they take on the value zero or one. Some of them represent more general expectations as I’ve been arguing but we have these goods. The goods have values which are expectations and then we have ways of composing these goods together to make more goods. I don’t have a completely satisfying picture here that says “Ah! It should be linear logic!” but it feels like if I imagine goods as these kinds of contracts then I naturally have something like a tensor. If I have good A and good B then I have a contract which is like “I owe you one A and one B and this is like a tensor. I naturally have the short of an object which is like the negation. So if I have good A and then somebody wants to short that good, then I can kind of make this argument for like all the other linear operators, like natural contract types.
If we put together a prediction market, we can we can force it to use classical logic if we want. That’s what logical induction does. Logical induction gives us something that approximates classical probability distributions but only by virtue of forcing. The market maker is saying “I will allow any trades that enforce the coherence properties associated with classical logic”. It’s like: things are either true or false. Propositions are assumed to eventually converge to one or zero. We assume that there’s not some unresolved proposition even though logical induction is manufactured to deal with logical uncertainty which means there are some undecidable propositions because we’re dealing with general propositions in mathematics. So it’s sort of enforcing this classicality with no justification. My hope there is like: Well, we should think about what falls out naturally from the idea of setting up a prediction market when the market maker isn’t obsessed with classical logic. As I said it just seems like whatever it is it seems closer to the linear logic.
Abram also shared the paper: From Classical to Intuitionistic Probability.
Let’s say: if you train a coherently goal-directed, situationally aware, somewhat-better-than-human-level model using baseline forms of self-supervised pre-training + RLHF on diverse, long-horizon, real-world tasks, my subjective probability is ~25% that this model will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.
Have you tried extending this gut estimate to something like:
If many labs use somewhat different training procedures to train their models but that each falls under the umbrella of “coherently goal-directed, situationally aware [...]”, what is the probability that at least one of these models “will be performing well in training in substantial part as part of an instrumental strategy for seeking power for itself and/or other AIs later.”?
Some Problems with Ordinal Optimization Frame
To the extent that Tegmark is concerned about exfohazards (he doesn’t seem to be very concerned AFAICT (?)), he would probably say that more powerful and yet more interpretable architectures are net positive.
I’m pretty sure I heard Alan Watts say something like that, at least in one direction (lower levels of organization → higher levels). “The conflict/disorder at the lower level of the Cosmos is required for cooperation/harmony on the higher level.”
Or maybe the Ultimate Good in the eyes of God is the epic sequence of: dead matter → RNA world → protocells → … → hairless apes throwing rocks at each other and chasing gazelles → weirdoes trying to accomplish the impossible task of raising the sanity waterline and carrying the world through the Big Filter of AI Doom → deep utopia/galaxy lit with consciousness/The Goddess of Everything Else finale.
I mostly stopped hearing about catastrophic forgetting when Really Large Language Models became The Thing, so I figured that it’s solvable by scale (likely conditional on some aspects of the training setup, idk, self-supervised predictive loss function?). Anthropic’s work on Sleeper Agents seems like a very strong piece of evidence that it is the case.
Still, if they’re right that KANs don’t have this problem at much smaller sizes than MLP-based NNs, that’s very interesting. Nevertheless, I think talking about catastrophic forgetting as a “serious problem in modern ML” seems significantly misleading
FWIW it was obvious to me
Behavioural Safety is Insufficient
Past this point, we assume following Ajeya Cotra that a strategically aware system which performs well enough to receive perfect human-provided external feedback has probably learned a deceptive human simulating model instead of the intended goal. The later techniques have the potential to address this failure mode. (It is possible that this system would still under-perform on sufficiently superhuman behavioral evaluations)
There are (IMO) plausible threat models in which alignment is very difficult but we don’t need to encounter deceptive alignment. Consider the following scenario:
Our alignment techinques (whatever they are) scale pretty well, as far as we can measure, even up to well-beyond-human-level AGI. However, in the year (say) 2100, the tails come apart. It gradually becomes pretty clear that what we want out powerful AIs to do and what they actually do turns out not to generalize that well outside of the distribution on which we have been testing them so far. At this point, it is to late to roll them back, e.g. because the AIs have become uncorrigible and/or power-seeking. The scenario may also have more systemic character, with AI having already been so tightly integrated into the economy that there is no “undo button”.
This doesn’t assume either the sharp left turn or deceptive alignment, but I’d put it at least at level 8 in your taxonomy.
I’d put the scenario from Karl von Wendt’s novel VIRTUA into this category.
Maybe Hanson et al.’s Grabby aliens model? @Anders_Sandberg said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he “had all of the components [of the model] on the table” and it just didn’t occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it’s less than 10 years, so...
Speaking of Hanson, prediction markets seem like a more central example. I don’t think the idea was [inconceivable in principle] 100 years ago.
ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about “the great silence” from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.
Taboo “evil” (locally, in contexts like this one)?
If you want to use it for ECL, then it’s not clear to me why internal computational states would matter.
Why did FHI get closed down? In the end, because it did not fit in with the surrounding administrative culture. I often described Oxford like a coral reef of calcified institutions built on top of each other, a hard structure that had emerged organically and haphazardly and hence had many little nooks and crannies where colorful fish could hide and thrive. FHI was one such fish but grew too big for its hole. At that point it became either vulnerable to predators, or had to enlarge the hole, upsetting the neighbors. When an organization grows in size or influence, it needs to scale in the right way to function well internally – but it also needs to scale its relationships to the environment to match what it is.
I don’t quite get what actions are available in the heat engine example.
Is it just choosing a random bit from H or C (in which case we can’t see whether it’s 0 or 1) OR a specific bit from W (in which case we know whether it’s 0 or 1) and moving it to another pool?
Any thoughts on Symbolica? (or “categorical deep learning” more broadly?)
All current state of the art large language models such as ChatGPT, Claude, and Gemini, are based on the same core architecture. As a result, they all suffer from the same limitations.
Extant models are expensive to train, complex to deploy, difficult to validate, and infamously prone to hallucination. Symbolica is redesigning how machines learn from the ground up.
We use the powerfully expressive language of category theory to develop models capable of learning algebraic structure. This enables our models to have a robust and structured model of the world; one that is explainable and verifiable.
It’s time for machines, like humans, to think symbolically.
How likely is it that Symbolica [or sth similar] produces a commercially viable product?
How likely is it that Symbolica creates a viable alternative for the current/classical DL?
I don’t think it’s that different from the intentions behind Conjecture’s CoEms proposal. (And it looks like Symbolica have more theory and experimental results backing up their ideas.)
Symbolica don’t use the framing of AI [safety/alignment/X-risk], but many people behind the project are associated with the Topos Institute that hosted some talks from e.g. Scott Garrabrant or Andrew Critch.
What is the expected value of their research for safety/verifiability/etc?
How likely is it that whatever Symbolica produces meaningfully contributes to doom (e.g. by advancing capabilities research without at the same time sufficiently/differentially advancing interpretability/verifiability of AI systems)?
(There’s also PlantingSpace but their shtick seems to be more “use probabilistic programming and category theory to build a cool Narrow AI-ish product” whereas Symbolica want to use category theory to revolutionize deep learning.)
I’m not aware of any, but you may call it “hybrid ontologies” or “ontological interfacing”.
There is an unsolved meta-problem but the meta-problem is an easy problem.
Yeah, that meme did reach me. But I was just assuming Ilya got back (was told to get back) to doing the usual Ilya superalignment things and decided (was told) not to stick his neck out.