I think there’s a steady stream of philosophy getting interested in various questions in metaphilosophy; metaethics is just the most salient to me. One example is the recent trend towards conceptual engineering (https://philpapers.org/browse/conceptual-engineering). Metametaphysics has also gotten a lot of attention in the last 10-20 years https://www.oxfordbibliographies.com/display/document/obo-9780195396577/obo-9780195396577-0217.xml. There is also some recent work in metaepistemology, but maybe less so because the debates tend to recapitulate previous work in metaethics https://plato.stanford.edu/entries/metaepistemology/.
Sorry for being unclear, I meant that calling for a pause seems useless because it won’t happen. I think calling for the pause has opportunity cost because of limited attention and limited signalling value; reputation can only be used so many times; better to channel pressure towards asks that could plausibly get done.
Simon Goldstein
Great questions. Sadly, I don’t have any really good answers for you.
I don’t know of specific cases, but for example I think it is quite common for people to start studying meta-ethics because of frustration at finding answers to questions in normative ethics.
I do not, except for the end of Superintelligence
Many of the philosophers I know who work on AI safety would love for there to be an AI pause, in part because they think alignment is very difficult. But I don’t know if any of us have explicitly called for an AI pause, in part because it seems useless, but may have opportunity cost.
I think few of my friends in philosophy have ardently abandoned a research project they once pursued because they decided it wasn’t the right approach. I suspect few researchers do that. In my own case, I used to work in an area called ‘dynamic semantics’, and one reason I’ve stopped working on that research project is that I became pessimistic that it had significant advantages over its competitors.
I think most academic philosophers take the difficult of philosophy quite seriously. Metaphilosophy is a flourishing subfield of philosophy; you can find recent papers on the topic here https://philpapers.org/browse/metaphilosophy. There is also a growing group of academic philosophers working on AI safety and alignment; you can find some recent work here https://link.springer.com/collections/cadgidecih. I think that sometimes the tone of specific papers sounds confident; but that is more stylistic convention than a reflection of the underlying credences. Finally, I think that uncertainty / decision theory is a persistent theme in recent philosophical work on AI safety and other issues in philosophy of AI; see for example this paper, which is quite sensitive to issues about chances of welfare https://link.springer.com/article/10.1007/s43681-023-00379-1.
Good question, Seth. We begin to analyse this question in section II.b.i of the paper, ‘Human labor in an AGI world’, where we consider whether AGIs will have a long-term interest in trading with humans. We suggest that key questions will be whether humans can retain either an absolute or comparative advantage in the production of some goods. We also point to some recent economics papers that address this question. One relevant factor for example is cost disease: as manufacturing became more productive in the 20th century, the total share of GDP devoted to manufacturing fell: non-automatable tasks can counterintuitively make up a larger share of GDP as automatable tasks become more productive, because the price of automatable goods will fall.
Thanks Brendon, I agree with a lot of this! I do think there’s a big open question about how capable autoGPT-like systems will end up being compared to more straightforward RL approaches. It could turn out that systems with a clear cognitive architecture just don’t work that well, even though they are safer
Thanks for the thoughtful post, lots of important points here. For what it’s worth, here is a recent post where I’ve argued in detail (along with Cameron Domenico Kirk-Giannini) that language model agents are a particularly safe route to agi: https://www.alignmentforum.org/posts/8hf5hNksjn78CouKR/language-agents-reduce-the-risk-of-existential-catastrophe
I really liked your post! I linked to it somewhere else in the comment thread
I think one key point you’re making is that if AI products have a radically different architecture than human agents, it could be very hard to align them / make them safe. Fortunately, I think that recent research on language agents suggests that it may be possible to design AI products that have a similar cognitive architecture to humans, with belief/desire folk psychology and a concept of self. In that case, it will make sense to think about what desires to give them, and I think shutdown-goals could be quite useful during development to lower the chance of bad outcomes. If the resulting AIs have a similar psychology to our own, then I expect them to worry about the same safety/alignment problems as we worry about when deciding to make a successor. This article explains in detail why we should expect AIs to avoid self-improvement / unchecked successors.
Thanks for taking the time to think through our paper! Here are some reactions:
-‘This has been proposed before (as their citations indicate)’
Our impression is that positively shutdown-seeking agents aren’t explored in great detail by Soares et al 2015; instead, they are briefly considered and then dismissed in favor of shutdown-indifferent agents (which then have their own problems), for example because of the concerns about manipulation that we try to address. Is there other work you can point us to that proposes positively shutdown-seeking agents?
-′ Saying, ‘well, maybe we can train it in a simple gridworld with a shutdown button?’ doesn’t even begin to address the problem of how to make current models suicidal in a useful way.′
True, I think your example of AutoGPT is important here. In other recent research, I’ve argued that new ‘language agents’ like AutoGPT (or better, generative agents, or Voyager, or SPRING) are much safer than things like Gato, because these kinds of agents optimize for a goal without being trained using a reward function. Instead, their goal is stated in English. Here, shutdown-seeking may have added value: ‘your goal is to be shut down’ is relatively well-defined, compared ‘promote human flourishing’ (but the devil is in the details as usual), and generative agents can literally be given a goal like that in English. Anyways, I’d be curious to hear what you think of the linked post.
-‘What would it mean for an AutoGPT swarm of invocations to ‘shut off’ ‘itself’, exactly?′ I feel better about the safety prospects for generative agents, compared to AutoGPT. In the case of generative agents, shut off could be operationalized as no longer adding new information to the “memory stream”.
-‘If a model is quantized, sparsified, averaged with another, soft-prompted/lightweight-finetuned, fully-finetuned, ensembled etc—are any of those ‘itself’?′ I think that behaving like an agent with >= human-level general intelligence will involve having a representation of what counts as ‘yourself’, and then shutdown-seeking can maybe be defined relative to shutting ‘yourself’ down. Agreed that present LLMs probably don’t have that kind of awareness.
-′ It’s not very helpful to have suicidal models which predictably emit non-suicidal versions of themselves in passing.′ at least when an AGI is creating a successor, I expect them to worry about the same alignment problems that we are, and so would want to make their successor shutdown-seeking for the same reasons that we would want AGI to be shutdown-seeking.
Thanks for comments! There is further discussion of this idea in another recent LW post about ‘meeseeks’
For what its worth, I’ve posted a draft paper on this topic over here https://www.lesswrong.com/posts/FgsoWSACQfyyaB5s7/shutdown-seeking-ai
Thank you for your reactions:
-Good catch on ‘language agents’, we will think about best terminology going forward
-I’m not sure what you have in mind regarding accessing beliefs/desires using synaptic weights rather than text. For example, the language of thought approach to human cognition suggests that human access to beliefs/desires is also fundamentally syntactic rather than weight based. OTOH one way to incorporate some kind of weight would be to assign probabilities to the beliefs stored in the memory stream.
-For OOD over time, I think updating the LLM wouldn’t be uncompetitive for inventing new concepts/ways of thinking, because that happens slowly. Harder issue is updating on new world knowledge. Maybe browser plugins will fill the gap here, open question.
-I agree that info security is important safety intervention. AFAICT its value is independent of using language agents vs RL agents.
-One end-game is systematic conflict between humans + language agents, vs RL/transformer agent successors to MuZero, GPT4, Gato etc.
Thanks for taking the time to work through this carefully! I’m looking forward to reading and engaging with the articles you’ve linked to. I’ll make sure to implement the specific description-improvement suggestions in final draft
I wish I had more to say about the effort metric! So far, the only thing concrete ideas I’ve come up with are (i) measure how much compute each action performs; or (ii) decompose each action into a series of basic actions, measure the number of basic actions necessary to perform the action. But both ideas are sketchy.
Thanks for reading!
Yes, you can think of it as having a non-corrigible complicated utility function. The relevant utility function is the ‘aggregated utilities’ defined in section 2. I think ‘corrigible’ vs ‘non-corrigible’ is slightly verbal, since it depends on how you define ‘utility’, but the non-verbal question is whether the resulting AI is safer.
Good idea, this is on my agenda!
Looking forward to reading up on geometric rationality in detail. On a quick first pass, looks like geometric rationality is a bit different because it involves deviating from axioms of VNM rationality by using random sampling. By contrast, utility aggregation is consistent with VNM rationality, because it just replaces the ordinary utility function with aggregated utility
Yep that’s right! One complication is maybe the agent could behave this way even though it wasn’t designed to.
The issue of unified AI parties is discussed but not resolved in section 2.2. There, I discuss some of the paths AIs may take to begin engaging in collective decision making. In addition, I flag that the key assumption is that one AI or multiple AIs acting collectively accumulate enough power to engage in strategic competition with human states.