Computing scientist and Systems architect. Currently doing self-funded AI/AGI safety research. I participate in AI standardization under the company name Holtman Systems Research: https://holtmansystemsresearch.nl/
Koen.Holtman
My quick take here is that your list of topics is not an introduction to AI Safety, it is an introduction to AI safety as seen from inside the MIRI/Yudkowsky bubble, where everything is hard, and nobody is making any progress. Some more diversity in viewpoints would be better.
For your audience, my go-to source would be to cover bits of Christian’s The Alignment Problem.
At t+7 years, I’ve still seen no explicit argument for robust AI collusion, yet tacit belief in this idea continues to channel attention away from a potential solution-space for AI safety problems, leaving something very much like a void.
I agree with you that this part of the AGI x-risk solution space, the part where one tries to design measures to lower the probability of collusion between AGIs, is very under-explored. However, I do not believe that the root cause of this lack of attention is a widely held ‘tacit belief’ that robust AGI collusion is inevitable.
It is easy to imagine the existence of a very intelligent person who nevertheless hates colluding with other people. It is easy to imagine the existence of an AI which approximately maximises a reward function which has a term in it that penalises collusion. So why is nobody working on creating or improving such an AI or penalty term?
My current x-risk community model is that the forces that channel people away from under-explored parts of the solution space have nothing to do with tacit assumptions about impossibilities. These forces operate at a much more pre-rational level of human psychology. Specifically: if there is no critical mass of people working in some part of the solution space already, then human social instincts will push most people away from starting to work there, because working there will necessarily be a very lonely affair. On a more rational level, the critical mass consideration is that if you want to do work that gets you engagement on your blog post, or citations on your academic paper, the best strategy is pick a part of the solution space that already has some people working in it.
TL;DR: if you want to encourage people to explore an under-visited part of the solution space, you are not primarily fighting against a tacit belief that this part of the space will be empty of solutions. Instead, you will need to win the fight against the belief that people will be lonely when they go into that part of the space.
Like Charlie said, there is a demonstration in AI Safety Gridworlds. I also cover these dynamics in a more general and game-theoretical sense in my AGI Agent Safety by Iteratively Improving the Utility Function: this paper also has running code behind it, and it formalises the setup as a two-player/two-agent game.
In general though, if people do not buy “You can’t fetch the coffee if you’re dead” problem as a thought experiment, then I am not sure if any running code based demo can change their mind.
I have been constructing a set of thought experiments, illustrated with grid worlds, that do not just demo the off-switch problem, but that also demo a solution to it. The whole setup intends to clarify what is really going on here, in a way that makes intuitive sense to a non-mathematical audience. Have not published these thought experiments yet in writing, only gave a talk about it. In theory, somebody could convert the grid world pictures in this talk into running code. If you want to learn more please contact me—I can walk you through my talk slide deck.
I think I disagree with Charlie’s hot take because Charlie seems to be assuming that the essence of the solution to “You can’t fetch the coffee if you’re dead” must be too complicated to show in a grid world. In fact, for the class of solutions I prefer, these solutions can be very easily shown in a grid world. Or at least easy in retrospect.
It depends. But yes, incorrect epistemics can make an AGI safer, if it is the right and carefully calibrated kind of incorrect. A goal-directed AGI that incorrectly believes that its off switch does not work will be less resistant to people using it. So the goal is here to design an AGI epistemics that is the right kind of incorrect.
Note: designing an AGI epistemics that is the right kind of incorrect seems to go against a lot of the principles that aspiring rationalists seem to hold dear, but I am not an aspiring rationalist. For more technical info on such designs, you can look up my sequence on counterfactual planning.
Yes, a lot of it has been informed by economics. Some authors emphasize the relation, others de-emphasize it.
The relation goes beyond alignment and safety research. The way in which modern ML research defines its metric of AI agent intelligence is directly based on utility theory, which was developed by Von Neumann and Morgenstern to describe games and economic behaviour.
Both explainable AI and interpretable AI are pronouns that are being used to have different meanings in different contexts. It really depends on the researcher what they mean by it.
Decision theory is a term used in mathematical statistics and philosophy. In applied AI terms, a decision theory is the algorithm used by an AI agent to compute what action to take next. The nature of this algorithm is obviously relevant to alignment. That being said, philosophers like to argue among themselves about different decision theories and how they relate to certain paradoxes and limit cases, and they conduct these arguments using a terminology that is entirely disconnected from that being used in most theoretical and applied AI research. Not all AI alignment researchers believe that these philosophical arguments are very relevant to moving forward AI alignment research.
It is definitely advisable to build a paper-clip maximiser that also needs to respect a whole bunch of additional stipulations about not harming people. The worry among many alignment researchers is that it might be very difficult to make these stipulations robust enough to deliver the level of safety we ideally want, especially in the case of AGIs that might get hugely intelligent or hugely powerful. As we are talking about not-yet-invented AGI technology, nobody really knows how easy or hard it will be to build robust-enough stipulations into it. It might be very easy in the end, but maybe not. Different researchers have different levels of optimism, but in the end nobody knows, and the conclusion remains the same no matter what the level of optimism is. The conclusion is to warn people about the risk and to do more alignment research with the aim to make it easier build robust-enough stipulations into potential future AGIs.
When one uses mathematics to clarify many AI alignment solutions, or even just to clarify Monte Carlo tree search as a decision making process, then the mathematical structures one finds can often best be interpreted as being mathematical counterfactuals, in the Pearl causal model sense. This explains the interest into counterfactual machine reasoning among many technical alignment researchers.
To explain this without using mathematics: say that we want to command a very powerful AGI agent to go about its duties while acting as if it cannot successfully bribe or threaten any human being. To find the best policy which respects this ‘while acting as if’ part of the command, the AGI will have to use counterfactual machine reasoning.
I continue to be surprised that people think a misaligned consequentialist intentionally trying to deceive human operators (as a power-seeking instrumental goal specifically) is the most probable failure mode.
Me too, but note how the analysis leading to the conclusion above is very open about excluding a huge number of failure modes leading to x-risk from consideration first:
[...] our focus here was on the most popular writings on threat models in which the main source of risk is technical, rather than through poor decisions made by humans in how to use AI.
In this context, I of course have to observe that any human decision, any decision to deploy an AGI agent that uses purely consequentialist planning towards maximising a simple metric, would be a very poor human decision to make indeed. But there are plenty of other poor decisions too that we need to worry about.
To minimize P(misalignment x-risk | AGI) we should work on technical solutions to societal-AGI alignment, which is where As internalize a distilled and routinely updated constellation of shared values as determined by deliberative democratic processes driven entirely by humans
I agree that this kind of work is massively overlooked by this community. I have done some investigations on the root causes of why it is overlooked. The TL;DR is that this work is less technically interesting, and that many technical people here (and in industry and academia) would like to avoid even thinking about any work that needs to triangulate between different stakeholders who might then get mad at them. For a longer version of this analysis, see my paper Demanding and Designing Aligned Cognitive Architectures, where I also make some specific recommendations.
My overall feeling is that the growth in the type of technical risk reduction research you are calling for will will have to be driven mostly by ‘demand pull’ from society, by laws and regulators that ban certain unaligned uses of AI.
But it seems like roughly the entire AI existential safety community is very excited about mechanistic interpretability and entirely dismissive of Stuart Russell’s approach, and this seems bizarre.
Data point: I consider myself part to be part of the AI x-risk community, but like you am not very excited about mechanistic interpretability research in an x-risk context. I think there is somewhat of a filter bubble effect going on, where people who are more exited about interpretability post more on this forum.
Stuart Russell’s approach is a broad agenda, and I am not on board with of all parts of it, but I definitely read his provable safety slogan as a call for more attention to the design approach where certain AI properties (like safety and interpretability properties) are robustly created by construction.
There is an analogy with computer programming here: a deep neural net is like a computer program written by an amateur without any domain knowledge, one that was carefully tweaked to pass all tests in the test suite. Interpreting such a program might be very difficult. (There is also the small matter that the program might fail spectacularly when given inputs not present in the test suite.) The best way to create an actually interpretable program is to build it from the ground up with interpretability in mind.
What is notable here is that the CS/software engineering people who deal with provable safety properties have long ago rejected the idea that provable safety should be about proving safe an already-existing bunch of spaghetti code that has passed a test suite. The problem of interpreting or reverse engineering such code is not considered a very interesting or urgent one in CS. But this problem seems to be exactly what a section of the ML community has now embarked on. As an intellectual quest, it is interesting. As a safety engineering approach for high-risk system components, I feel it has very limited potential.
First some background on me, then some thoughts.
I am an alignment researcher and I read LW and AF occasionally. I tend to focus more on reading academic papers, not the alignment blogosphere. I read LW and AF mostly to find links to academic papers I might otherwise overlook, and for the occasional long-from analysis blogpost that the writer(s) put several months in to write. I am not a rationalist.
What I am seeing on LW is that numerically, many of the AI posts are from from newcomers to the alignment field, or from people who are just thinking about getting in. This is perfectly is fine, because they need some place to post and potentially get their questions answered. I do not think that the cause of alignment would be improved by moving all of these AI newcomer posts out of LW and onto AF,
So if there is a concern that high-quality long-form rationalist content is being drowned out by all the AI talk, I suggest you create an AF-like sub-forum dedicated to rationalist thought.
The AF versions of posts are primarily meant to be a thing you can link to professionally without having to explain the context of a lot of weird, not-obviously-related topics that show up on LessWrong.
From were I am standing, professionally speaking, AF has plenty of way-to-weird AI alignment content on it. Any policy maker or card-carrying AI/ML researcher browsing AF will quickly conclude that it is a place where posters can venture far outside of their political or main stream science Overton windows, without ever being shouted down or even frowned upon by the rest of the posters. It is also the most up-voted and commented-on posts that are often the least inside any Overton window. This is just a thing that has grown historically, there is definitely beauty and value in it, and it is definitely is too late to change now. Too late also given that that EY has now gone full prophet-of-doom.
What I am hearing is that some alignment newcomers who have spent a few months doing original research, and writing a paper on it, have trouble getting their post on their results promoted from LW to AF. This is a de-motivator which I feel limits the growth of the field, so I would not mind if the moderators of this site start using (and advertising that they are using) an automatic rule where, if it is clear that the post publishes alignment research results that took moths of honest effort to produce, any author request to promote it to AF will be almost automatically granted, no matter what the moderators think about the quality of the work inside.
You are welcome. Another answer to your question just occurred to me.
If you count AI fairness research as a sub-type of AI alignment research, then you can find a whole community of alignment researchers who talk quite a lot with each other about ‘aligned with whom’ in quite sophisticated ways. Reference: the main conference of this community is ACM FAccT.
In EA and on this forum, when people count the number of alignment researchers, they usually count dedicated x-risk alignment researchers only, and not the people working on fairness, or on the problem of making self-driving cars safer. There is a somewhat unexamined assumption in the AI x-risk community that fairness and self-driving car safety techniques are not very relevant to managing AI x-risk, both in the technical space and the policy space. The way my x-risk technical work is going, it is increasingly telling me that this unexamined assumption is entirely wrong.
On a lighter note:
ignoring those values means we won’t actually achieve ‘alignment’ even when we think we have.
Well, as long as the ‘we’ you are talking about here is a group of people that still includes Eliezer Yudkowsky, then I can guarantee that ‘we’ are in no danger of ever collectively believing that we have achieved alignment.
When AI alignment researchers talk about ‘alignment’, they often seem to have a mental model where either (1) there’s a single relevant human user whose latent preferences the AI system should become aligned with (e.g. a self-driving car with a single passenger); or (2) there’s all 7.8 billion humans that the AI system should be aligned with, so it doesn’t impose global catastrophic risks.
[...]
So, I’m left wondering what AI safety researchers are really talking about when they talk about ‘alignment’.
The simple answer here is that many technical AI safety researchers on this forum talk exclusively about (1) and (2) so that they can avoid confronting all of the difficult socio-political issues you mention. Many of them avoid it specifically because they believe they would not be very good at politics anyway.
This is of course a shame, because the cases between (1) and (2) have a level of complexity that also needs to be investigated. I am a technical AI safety researcher who is increasingly moving into the space between (1) and (2), in part also because I consider (1) and (2) to be more solved than many other AI safety researchers on this forum like to believe.
This then has me talking about alignment with locally applicable social contracts, and about the technology of how such social contracts can be encoded into an AI. See for example the intro post and paper here.
this is something you would use on top of a model trained and monitored by engineers with domain knowledge.
OK, that is a good way to frame it.
I guess I should make another general remark here.
Yes, using implicit knowledge in your solution would be considered cheating, and bad form, when passing AI system benchmarks which intend to test more generic capabilities.
However, if I were to buy an alignment solution from a startup, then I would prefer to be told that the solution encodes a lot of relevant implicit knowledge about the problem domain. Incorporating such knowledge would no longer be cheating, it would be an expected part of safety engineering.
This seeming contradiction is of course one of these things that makes AI safety engineering so interesting as a field.
Interesting. Some high-level thoughts:
When reading your definition of concept extrapolation as it appears here here:
Concept extrapolation is the skill of taking a concept, a feature, or a goal that is defined in a narrow training situation… and extrapolating it safely to a more general situation.
this reads to me like the problem of Robustness to Distributional Change from Concrete Problems. This problem also often known as out-of-distribution robustness, but note that Concrete Problems also considers solutions like the AI detecting that it is out-of-training distribution and then asking for supervisory input. I think you are also considering such approaches within the broader scope of your work.
To me, the above benchmark does not smell like being about out-of-distribution problems anymore, it reminds me more of the problem of unsupervised learning, specifically the problem of clustering unlabelled data into distinct groups.
One (general but naive) way to compute the two desired classifiers would be to first take the unlabelled dataset and use unsupervised learning to classify it into 4 distinct clusters. Then, use the labelled data to single out the two clusters that also appear in the labelled dataset, or at least the two clusters that appear appear most often. Then, construct the two classifiers as follows. Say that the two groups also in the labelled data are cluster A, whose members mostly have the label happy, and cluster B, whose members mostly have the label sad. Call the remaining clusters C and D. Then the two classifiers are (A and C=happy, B and D = sad) and (A and D = happy, B and C = sad). Note that this approach will not likely win any benchmark contest, as the initial clustering step fails to use some information that is available in the labelled dataset. I mention it mostly because it highlights a certain viewpoint on the problem.
For better benchmark results, you need a more specialised clustering algorithm (this type is usually called Semi-Supervised Clustering I believe) that can exploit the fact that the labelled dataset gives you some prior information on the shapes of two of the clusters you want.
One might also argue that, if the above general unsupervised clustering based method does not give good benchmark results, then this is a sign that, to be prepared for every possible model split, you will need more than just two classifiers.
Not sure what makes you think ‘strawmen’ at 2, but I can try to unpack this more for you.
Many warnings about unaligned AI start with the observation that it is a very bad idea to put some naively constructed reward function, like ‘maximize paper clip production’, into a sufficiently powerful AI. Nowadays on this forum, this is often called the ‘outer alignment’ problem. If you are truly worried about this problem and its impact on human survival, then it follows that you should be interested in doing the Hard Thing of helping people all over the world write less naively constructed reward functions to put into their future AIs.
John writes:
Far and away the most common failure mode among self-identifying alignment researchers is to look for Clever Ways To Avoid Doing Hard Things. [...] The most common pattern along these lines is to propose outsourcing the Hard Parts to some future AI [...]
This pattern of outsourcing the Hard Part to the AI is definitely on display when it comes to 2 above. Academic AI/ML research also tends to ignore this Hard Part entirely, and implicitely outsources it to applied AI researchers, or even to the end users.
Wait: fixing a utility function and then argmaxing over all possible plans is not an alignment design pattern, it is the bog-standard operational definition of what an optimal-policy MDP agent should do. This is what Stuart Russell calls the ‘standard model’ of AI. This is an agent design pattern, not an alignment design pattern. To be an alignment design pattern in my book, you have to be adding something extra or doing something different that is not yet in the bog-standard agent design.
I think you are showing that an actor-grader is just a utility maximiser in a fancy linguistic dress. Again, not an alignment design pattern in my book.
Though your use of the word doomed sounds too absolute to me, I agree with the main technical points in your analysis. But I would feel better if you change the terminology from alignment design pattern to agent design pattern.