Would you say that models designed from the ground up to be collaborative and capabilitarian would be a net win for alignment, even if they’re not explicitly weakened in terms of helping people develop capabilities? I’d be worried that they could multiply human efforts equally, but with humans spending more effort on capabilities, that’s still a net negative.
rpglover64
I really appreciate the call-out where modern RL for AI does not equal reward-seeking (though I also appreciate @tailcalled ’s reminder that historical RL did involve reward during deployment); this point has been made before, but not so thoroughly or clearly.
A framing that feels alive for me is that AlphaGo didn’t significantly innovate in the goal-directed search (applying MCTS was clever, but not new) but did innovate in both data generation (use search to generate training data, which improves the search) and offline-RL.
before:
after:
Here the difference seems only to be spacing, but I’ve also seen bulleted lists appear. I think but I can’t recall for sure that I’ve seen something similar happen to top-level posts.
This is also mitigated by automatic images like gravatar or the ssh key visualization. I wonder if they can be made small enough to just add to usernames everywhere while maintaining enough distinguishable representations.
Note that every issue you mentioned here can be dealt with by trading off capabilities.
Yes. The trend I see is “pursue capabilities, worry about safety as an afterthought if at all”. Pushing the boundaries of what is possible on the capabilities front subject to severe safety constraints is a valid safety strategy to consider (IIRC, this is one way to describe davidad’s proposal), but most orgs don’t want to bite the bullet of a heavy alignment tax.
I also think you’re underestimating how restrictive your mitigations are. For example, your mitigation for sycophancy rules out RLHF, since the “HF” part lets the model know what responses are desired. Also, for deception, I wasn’t specifically thinking of strategic deception; for general deception, limiting situational awareness doesn’t prevent it arising (though it lessens its danger), and if you want to avoid the capability, you’d need to avoid any mention of e.g. honesty in the training.
The “sleeper agent” paper I think needs to be reframed. The model isn’t plotting to install a software backdoor, the training data instructed it to do so. Or simply put, there was sabotaged information used for model training.
This framing is explicitly discussed in the paper. The point was to argue that RLHF without targeting the hidden behavior wouldn’t eliminate it. One threat to validity is the possibility that artificially induced hidden behavior is different than naturally occurring hidden behavior, but that’s not a given.
If you want a ‘clean’ LLM that never outputs dirty predicted next tokens, you need to ensure that 100% of it’s training examples are clean.
First of all, it’s impossible to get “100% clean data”, but there is a question of whether 5 9s of cleanliness is enough; it shouldn’t be, if you want a training pipeline that’s capable of learning from rare examples. Separate from that, some behavior is either subtle or emergent; examples include “power seeking”, “sycophancy”, and “deception”. You can’t reliably eliminate them from the training data because they’re not purely properties of data.
Since the sleeper agents paper, I’ve been thinking about a special case of this, namely trigger detection in sleeper agents. The simple triggers in the paper seem like they should be easy to detect by the “attention monitoring” approach you allude to, but it also seems straightforward to design more subtle, or even steganographic, triggers.
A question I’m curious about it whether hard-to-detect triggers also induce more transfer from the non-triggered case to the triggered case. I suspect not, but I could see it happening, and it would be nice if it does.
Let’s consider the trolley problem. One consequentialist solution is “whichever choice leads to the best utility over the lifetime of the universe”, which is intractable. This meta-principle rules it out as follows: if, for example, you learned that one of the 5 was on the brink of starting a nuclear war and the lone one was on the brink of curing aging, that would say switch, but if the two identities were flipped, it would say stay, and generally, there are too many unobservables to consider. By contrast, a simple utilitarian approach of “always switch” is allowed by the principle, as are approaches that take into account demographics or personal importance.
The principle also suggests that killing a random person on the street is bad, even if the person turns out to be plotting a mass murder, and conversely, a doctor saving said person’s life is good.
Two additional cases where the principle may be useful and doesn’t completely correspond to common sense:
I once read an article by a former vegan arguing against veganism and vegetarianism; one example was the fact that the act of harvesting grain involves many painful deaths of field mice, and that’s not particularly better than killing one cow. Applying the principle, this suggests that suffering or indirect death cannot straightforwardly be the basis for these dietary choices, and that consent is on shaky ground.
When thinking about building a tool (like the LW infrastructure) that could be either hugely positive (because it leads to aligned AI) or hugely negative (because it leads to unaligned AI by increasing AI discussions), and there isn’t really a way to know which, you are morally free to build it or not; any steps you take to increase the likelihood of a positive outcome are good, but you are not required to stop building the tool due to a huge unknowable risk. Of course, if there’s compelling reason to believe that the tool is net-negative, that reduces the variance and suggests that you shouldn’t build it (e.g. most AI capabilities advancements).
Framed a different way, the principle is, “Don’t tie yourself up in knots overthinking.” It’s slightly reminiscent of quiescence search in that it’s solving a similar “horizon effect” problem, but it’s doing so by discarding evaluation heuristics that are not locally stable.
This makes me think that a useful meta-principle for application of moral principles in the absence omniscience is “robustness to auxillary information.” Phrased another way, if the variance of the outcomes of your choices is high according to a moral principle, in all but the most extreme cases, either find more information or pick a different moral principle.
I had an insight about the implications of NAH which I believe is useful to communicate if true and useful to dispel if false; I don’t think it has been explicitly mentioned before.
One of Eliezer’s examples is “The AI must be able to make a cellularly identical but not molecularly identical duplicate of a strawberry.” One of the difficulties is explaining to the AI what that means. This is a problem with communicating across different ontologies—the AI sees the world completely differently than we do. If NAH in a strong sense is true, then this problem goes away on its own as capabilities increase; that is, AGI will understand us when we communicate something that has a coherent natural interpretation, even without extra effort on our part to translate it to the AGI version of machine code.
Does that seem right?
(Why “Top 3” instead of “literally the top priority?” Well, I do think a successful AGI lab also needs have top-quality researchers, and other forms of operational excellence beyond the ones this post focuses on. You only get one top priority, )
I think the situation is more dire than this post suggests, mostly because “You only get one top priority.” If your top priority is anything other than this kind of organizational adequacy, it will take precedence too often; if your top priority is organizational adequacy, you probably can’t get off the ground.
The best distillation of my understanding regarding why “second priority” is basically the same as “not a priority at all” is this twitter thread by Dan Luu.
The fear was that if they said that they needed to ship fast and improve reliability, reliability would be used as an excuse to not ship quickly and needing to ship quickly would be used as an excuse for poor reliability and they’d achieve none of their goals.
I just read an article that reminded me of this post. The relevant section starts with “Bender and Manning’s biggest disagreement is over how meaning is created”. Bender’s position seems to have some similarities with the thesis you present here, especially when viewed in contrast to what Manning claims is the currently more popular position that meaning can arise purely from distributional properties of language.
This got me wondering: if Bender is correct, then there is a fundamental limitation in how well (pure) language models can understand the world; are there ways to test this hypothesis, and what does it mean for alignment?
Thoughts?
Interesting. I’m reminded of this definition of “beauty”.
One minor objection I have to the contents of this post is the conflation of models that are fine-tuned (like ChatGPT) and models that are purely self-supervised (like early GPT3); the former has no pretenses of doing only next token prediction.
But they seem like they are only doing part of the “intelligence thing”.
I want to be careful here; there is some evidence to suggest that they are doing (or at least capable of doing) a huge portion of the “intelligence thing”, including planning, induction, and search, and even more if you include minor external capabilities like storage.
I don’t know if anyone else has spoken about this, but since thinking about LLMs a little I am starting to feel like their something analagoss to a small LLM (SLM?) embedded somewhere as a component in humans
I know that the phenomenon has been studied for reading and listening (I personally get a kick out of garden-path sentences); the relevant fields are “natural language processing” and “computational linguistics”. I don’t know know of any work specifically that addressed it in the “speaking” setting.
if we want to build something “human level” then it stands to reason that it would end up with specialized components for the same sorts of things humans have specialized components for.
Soft disagree. We’re actively building the specialized components because that’s what we want, not because that’s particularly useful for AGI.
Some thoughts:
Those who expect fast takeoffs would see the sub-human phase as a blip on the radar on the way to super-human
The model you describe is presumably a specialist model (if it were generalist and capable of super-human biology, it would plausibly count as super-human; if it were not capable of super-human biology, it would not be very useful for the purpose you describe). In this case, the source of the risk is better thought of as the actors operating the model and the weapons produced; the AI is just a tool
Super-human AI is a particularly salient risk because unlike others, there is reason to expect it to be unintentional; most people don’t want to destroy the world
The actions for how to reduce xrisk from sub-human AI and from super-human AI are likely to be very different, with the former being mostly focused on the uses of the AI and the latter being on solving relatively novel technical and social problems
I think “sufficiently” is doing a lot of work here. For example, are we talking about >99% chance that it kills <1% of humanity, or >50% chance that it kills <50% of humanity?
I also don’t think “something in the middle” is the right characterization; I think “something else” it more accurate. I think that the failure you’re pointing at will look less like a power struggle or akrasia and more like an emergent goal structure that wasn’t really present in either part.
I also think that “cyborg alignment” is in many ways a much more tractable problem than “AI alignment” (and in some ways even less tractable, because of pesky human psychology):
It’s a much more gradual problem; a misaligned cyborg (with no agentic AI components) is not directly capable of FOOM (Amdhal’s law was mentioned elsewhere in the comments as a limit on usefulness of cyborgism, but it’s also a limit on damage)
It has been studied longer and has existed longer; all technologies have influenced human thought
It also may be an important paradigm to study (even if we don’t actively create tools for it) because it’s already happening.
Like, I may not want to become a cyborg if I stop being me, but that’s a separate concern from whether it’s bad for alignment (if the resulting cyborg is still aligned).
This isn’t directly related to TMS, but I’ve been trying to get an answer to this question for years, and maybe you have one.
When doing TMS, or any depression treatment, or any supplementation experiment, etc. it would make sense to track the effects objectively (in addition to, not as a replacement for subjective monitoring). I haven’t found any particularly good option for this, especially if I want to self-administer it most days. Quantified mind comes close, but it’s really hard to use their interface to construct a custom battery and an indefinite experiment.
Do you know of anything?