I am somewhat baffled by the fact that I have never ran into somebody who is actively working on developing a paradigm of AGI which is targeted at creating a system that the system is just inherently transparent to the operators.
If you’re having a list sorting algorithm like QuickSort, you can just look at the code and then get lots of intuitions about what kinds of properties the code has. An AGI would of course be much, much more complex than QuickSort, but I am pretty sure that there is a program that you can write down that has the same structural property of being interpretable in this way, where the algorithm also happens to define an AGI.
And this seems to be especially the case when you consider that when building the system we can build it in such a way that we have many components and these components have sub-components such that in the end, we have some pretty small set of instructions that does some specific task that is then understandable. And if you understand this component, you can probably understand how this set of instructions behaves in some larger module that uses this set of instructions.
Everything interpretability tries to do, we would just get for free in this kind of paradigm. Moreover, we could design the system in such a way that we have additional good properties. Instead of using SGD in order to find just some set of weights that performs well, that we then interpret, we could just constrain the kinds of algorithms we design in such a way that they are as interpretable as possible, such that we are subjected so strongly to the will of SGD and what algorithms it finds.
Maybe these people exist (if you are one please say hello), but I have talked to probably between 20 and 40 people who would describe themselves as doing AI alignment research and never came something like this up even remotely.
Basically, this is my current research agenda now. I’m not necessarily saying this is definitely the best thing that will save everyone and everybody should do this, but if zero people do this, it seems pretty strange to me. So I’m wondering if there are some standard arguments that I have not come across yet, while this kind of thing is actually really stupid to do.
There are two counter-arguments to this that I’m aware of, that I don’t think in themselves justify not working on this.
This seems like a really hard program and might take just way too long and then we’re already all dead by the time we would have built AGI in this way.
This kind of paradigm comes with the inherent problem that because the code is interpretable it becomes probably easy to see once you get a really capable algorithm that is basically an AGI. In that case, any person on the team that understands the code well enough can just take the code and do some unilateral madness. So you need to find a lot of people that are aligned enough such that they could work on this, which might be extremely difficult.
Though I’m not even sure how much of a problem point 2 is, because that seems to be a problem in any paradigm, no matter what we do, we probably end up being able to build unaligned AGI before we know how to align it. But maybe it is especially pronounced in this kind of approach. Though consider how much effort we need to invest in order to bridge the gap from being able to build an unaligned AI to being able to build an aligned AI, in any paradigm. I think that time might be especially short in this paradigm.
I feel like what MIRI is doing doesn’t quite count. At least from my limited understanding, they are trying to identify problems that are likely to come up in highly intelligent systems and solve these problems in advance, but not necessarily advancing <interpretable/alignable> capabilities in the way that I am imagining. Though I do, of course, have no idea about what they’re doing in terms of the research that they do not make public.
At some point you have to deal with the fact the fact that understanding the world entails knowing lots and lots of stuff—things like “tires are usually black”, or “it’s gauche to wear white after labor day”, etc.
There seem to be only two options:
Humans manually type in “tires are usually black” and zillions more things like that. This is very labor-intensive, if it’s possible at all. Cyc is the famous example along these lines. Davidad’s recent proposal is that we should try to do this.
A learning algorithm infers zillions of regularities in the world, like the fact that tires are usually black. That’s the deep learning approach, but there are also many non-deep-learning approaches in this category. I think conventional wisdom (which I happen to share) is that this category is the only category that might actually get to powerful AGI. And I don’t see how this category can be compatible with “creating a system that the system is just inherently transparent to the operators”, because the AGI will do different things depending on its “knowledge”, i.e. the giant collection of regularities that it has discovered, which are (presumably) unlabeled-by-default and probably a giant mess of things vaguely like “PATTERN 87462: IF BOTH PATTERN 24953 AND PATTERN 758463 ARE SIMULTANEOUSLY ACTIVE RIGHT NOW THEN IT’S MARGINALLY MORE LIKELY THAT PATTERN 217364 WILL BE ACTIVE SOON”, or whatever. And then the AGI does something, and humans have their work cut out figuring out why.
There might be a middle way between these—I think the probabilistic programming people might describe their roadmap-to-AGI that way?—but I don’t understand those kinds of plans, or if I do, then I don’t believe them.
I think the second-setup still allows for powerful AGI that’s more explainable than current AI, in the same way that humans can kind of explain decisions to each other, but not very well at the level of neuroscience.
If something like natural abstractions are real, then this would get easier. I have a hard time not believing a weak version of this (e.g. human and AGI neuron structures could be totally different, but they’d both end up with some basic things like “the concept of 1”).
On https://consensusknowledge.com, I described the idea of building a knowledge database that is understandable for both people and computers, that is, for all intelligent agents. It would be a component responsible for memory and interactions with other agents. Using this component, agents could increase intelligence much faster, which could lead to the emergence of the collective human superintelligence, AGI, and generally the collective superintelligence of all intelligent agents. At the same time, due to the interpretability of the database of knowledge and information, such intelligence would be much safer. Thinking performed by AI would also be much more interpretable.
Please let me know what you think about this.
I haven’t read it in detail.
The hard part of the problem is that we need to have a system that can build up a good world model on it’s own. There is too much stuff, such that it takes way way too long for a human to enter everything. Also I think that we need to be able to process basically arbitrary input streams with our algorithm. E.g. build a model of the world just by seeing a camera feed and the input of a microphone.
And then we want to figure out how to constrain the world model, such that if we use some planning algorithm we also designed on this world model we know that it won’t kill us because there is weird stuff in the world model, like there is weird stuff in solomonoff induction, because that are just arbitrary programs.
Also, a hard part is to make a world model that is that general, that it can represent the complexity of the real world interpretable.
If you have a database where you just enter facts about the world like X laptop has Y resolution, that seems to be not nearly powerful enough. Your world model only seems to be complex and talk about the real world, because you use natural language words as descriptors. So to a human brain these things have meaning, but not to a computer by default. That is how you can get a false sense of how good your world model is.
Is it also true for a large group of people? If yes then why?
Cyc does not work. At least not yet. I haven’t really looked into it a lot, but I expect that it will also not work in the near future for anything like doing a pivotal act. And they got a lot of man-hours put into it. In principle, it could probably succeed with enough data input, but it is not practical. Also, it would not succeed if you don’t have the right inference algorithms, and I guess that would be hard to notice when you are distracted entering all the data. Because you can just never stop entering the data, as there is so much of it to enter.
> Cyc does not work.
What if the group of users adding knowledge was significantly larger than the Cyc team?.
Edit: I ask because CyC is built by a group of its employees, it is not crowdsourced. Crowdsourcing often involves a much larger group of people, like in Wikipedia.
> In principle, it could probably succeed with enough data input, but it is not practical.
Why is it not practical?
> that would be hard to notice
What do you mean by “to notice” here?
Cyc does not seem like the things that I would expect to work very well compared to a system that can build the world model from scratch because even if it is crowd sourced it would take to much effort.
I mean notice that the inference algorithms are too bad, to make the system capable enough. You can still increase the capability of the system very slowly, by just adding more data. So it seems easy to instead of fixing the inference, to just focus on adding more data, which is the wrong move in that situation.
Tamsin Leake’s project might match what you’re looking for.
I feel like the thing that I’m hinting is not directly related to QACI. I’m talking about a specific way to construct an AGI where we write down all of the algorithms explicitly, whereas the QACI part of QACI, is about specifying an objective that is aligned when optimized very hard. It seems like, in the thing that I’m describing, you would get the alignment properties from a different place. You get them because you understand the algorithm of intelligence that you have written down very well. Whereas in QHCI, you get the alignment properties by successfully pointing to the causal process that is the human in the world that you want to “simulate” in order to determine the “actual objective”.
Just to clarify, when I say non-DUMB way, I mainly refer to using giant neural networks and just making them more capable in order to get to intelligent systems to be the DUMB way. And Tasman’s thing seems to be one of the least DUMB things I have heard recently. I can’t see how this obviously fails (yet), though, of course, this doesn’t necessarily imply that it will succeed (though it is of cause possible).
I am also interested in interpretable ML. I am developing artificial semiosis, a human-like AI training process which can achieve aligned (transparency-based, interpretability-based) cognition. You can find an example of the algorithms I am making here: the AI runs a non-deep-learning algorithm, does some reflection and forms a meaning for someone “saying” something, a meaning different from the usual meaning for humans, but perfectly interpretable.
I support then the case for differential technological development:
Regarding 1, it may take several years to have interpretable ML reach capabilities equivalent to LLMs, but the future may offer surprises either in terms of coordination to pause the development of “opaque” advanced AI or of deep learning hitting a wall… at killing everyone. Let’s have a plan also for the case we are still alive.
Regarding 2, interpretable ML would need to have programmed control mechanisms to be aligned. There is currently no such a field of AI safety as we do not have yet interpretable ML, but I imagine computer engineers being able to make progress on these control mechanisms (being able to make more progress than on mechanistic interpretability of LLMs). While it is true that control mechanisms can be disabled, you can always advocate for the highest security (like in Ian Hogarth’s Island idea). You can then also reject this counterargument.
mishka noted that this paradigm of AI is more foomable. Self-modification is a huge problem. I have an intuition interpretable ML will exhibit a form of scaffolding, in that control mechanisms for robustness (i.e. for achieving capabilities) can advantageously double as alignment mechanisms. Thanks to interpretable ML, engineers may be able to study self-modification already in systems with limited capabilities and learn the right constraints.