… whilst also creating incredible amounts of stress/trauma at the same time that would wash it any such effect.
(Fascinating theory though)
… whilst also creating incredible amounts of stress/trauma at the same time that would wash it any such effect.
(Fascinating theory though)
Could you please elaborate?
Invention is distinct from widespread usage.
I’m kind of confused by why these consequences didn’t hit home earlier
Many people who were opposed to the war focused on less persuasive argumentative approachs such as implying that anyone who supported it was a warmonger, calling Bush a war criminal or focusing purely on the number of deaths.
I suspect that people were more focused on expressing their frustration or signalling their tribal allegience rather than trying to actually change anyone’s mind and that likely contributed to this delay.
We work on this mission because we believe that if we all collectively know more about AI, we will make better decisions on average.
Indeed, this is what I’d expect by default, but I wonder how much I should update from having seen not one, but three members of staff recently make an incredibly bad decision, especially since you’d expect an Epoch staff member to be far more informed than the average reader.
People used to believe that the Internet would usher in an Age of Enlightenment. Yeah, that didn’t happen..
Like, what looks my motte and what looks like my bailey?
Between an overwhelmingly strong effect and a pretty strong effect.
There are precious few, if any, arguments of the form “But here’s a logical reason why we’re doomed!” that can distinguish between these two worlds.
Feels like that’s a Motte and Bailey?
Sure, if you believe that the claimed psychological effects are very, very strong, then that’s the case.
However, let’s suppose you don’t believe they’re quite that strong. Even if you believe that these effects are pretty strong, then sufficiently strong arguments may still provide decent evidence.
In terms of why the evidence I provided is strong:
There’s a well known phenomenon of “cranks” where non-experts end up believing in arguments that sound super persuasive to them, but which are obviously false to experts. The CAIS letter ruled that out.
It’s well-known that it’s easy to construct theoretical arguments that sound extremely persuasive, but bear no relation to how things work in practise. Having a degree of empirical evidence of some of the core claims greatly weakens these arguments.
So it’s not just that these are strong arguments. It’s that these are arguments that you might expect to provide some signal even if you thought the claimed effect was strong, but not overwhelmingly so.
I don’t know man, seems a lot less plausible now that we have the CAIS letter and inference-time compute started demonstrating a lot of concrete examples of AIs misbehaving[1].
So before the CAIS letter there were some good outside view arguments and in the ChatGPT 3.5-4 era alignment-by-default seemed quite defensible[2], but now we’re in an era where neither of these is the case, so your argument pulls a lot less of a punch.
In retrospect, it isn’t plausible that IDA with some improvements yields a scalable alignment solution (especially while being competitive).
Why do you say this?
A major factor for me is the extent that they expect the book to bring new life into the conversation about AI Safety. One problem with running a perfectly neutral forum is that people explore 1000 different directions at the cost of moving the conversation forward. There’s a lot of value in terms of focusing people’s attention in the same direction such that progress can be made.
Lots of fascinating points, however:
a) You raise some interesting points about how the inner character is underdefined more than people often realise, but I think it’s also worth flagging that there’s less of a void these days given that a lot more effort is being put into writing detailed model specs
b) I am less dismissive about the risk of publicly talking about alignment research than I was before seeing Claude quote its own scenario, however think you’ve neglected the potential for us to apply filtering to the training data. Whilst I don’t think the solution will be that simple, I don’t think the relation is quite as straightforward as you claim.
c) The discussion of “how do you think the LLM’s feel about these experiments” is interesting, but it is also overly anthromorphic. LLM’s are anthromorphic to a certain extent having been trained on human data, but it is still mistaken to run a purely anthromorphic analysis that doesn’t account for other training dynamics.
d) Whilst you make a good point in terms of how the artificiality of the scenario might be affecting the experiment, I feel you’re being overly critical of some of research into how models might misbehave. Single papers are rarely definitive and often there’s value in just showing a phenomenon exists in order to spur further research on it, which can explore a wider range of theories about mechanisms. It’s very easy to say “oh this is poor quality research because it doesn’t my favourite objection”. I’ve probably fallen into this trap myself. However, the number of possible objections that could be made is often pretty large and if you never published until you addressed everything, you’d most likely never publish.
e) I worry that some of your skepticism of the risks manages to be persuasive by casting vague asperations that are disconnected from the actual strength of the arguments. You’re like “oh, the future, the future, people are always saying it’ll happen in the future”, which probably sounds convincing to folks who haven’t been following that closely, but it’s a lot less persuasive if you know that we’ve been consistently seeing stronger results over time (in addition to a recent spike in anecdotes with the new reasoning models). This is just a natural part of the process, when you’re trying to figure out how to conduct solid research in a new domain, of course it’s going to take some time.
I agree that imitation learning seems underrated.
People think of imitation learning as weak, but they forget about the ability to amplify these models post training (I discuss this briefly here).
I think it’s valuable for some people to say that it’s a terrible idea in advance so they have credibility after things go wrong.
I’m confused about what your bounty is asking exactly, but well done on posting this:
I’ll copy some arguments I wrote as part of an email thread I had with Michael Nielsen discussing his recent post on reconsidering alignment as a goal:
The core limitation of the standard differential tech development/d/acc/coceleration plans is that these kinds of imperfect defenses only buy time (this position can ironically be justified with the extremely persuasive arguments you’ve provided in your article). An aligned ASI, if it were possible, would be capable of providing defenses with a degree of perfection beyond the capabilities of human institutions to provide. A key component of this is that we could tell it what we want and simply trust it to do the right thing. We would get a longer stable solution. Plans involving less powerful AIs or a more limited degree of alignment almost never provide this.
That said, I have proposed that humanity might want to pursue a wisdom explosion instead of an intelligence explosion: https://aiimpacts.org/some-preliminary-notes-on-the-promise-of-a-wisdom-explosion/. This would reduce the capability externalities associated with ASI and allow us to tap AI advice to steer towards a longer term stable solution
(The wise AI angle is more helpful for reducing externalities from reckless actors, but feels less helpful for against malicious actors utilising highly assymetric technology. It is a matter of debate about to what extent malicious preferences are intrinsic vs. to what extent they are a result of mistaken beliefs. However, my suspicion is that there are people with these intrinsic preferences).
And a response to the claim that alignment involves endorsing a “dictatorial singleton solution”:
Many people have a sense that aligned AI has to be part of the solution, even if they don’t know what that solution is. I’ll try to explain why this might be reasonable:
1) Whilst most folk think it’s unlikely the world unites to work on a global AGI project, it’s not clearly impossible and perhaps you’d consider this less dictatorial?
2) There may be ways of surviving a multi-polar world. One possibility would be merging the AGI’s of multiple factions into a single, unitary AI. Alternatively, diplomacy might work if there’s only a few factions.
3) Someone might come up with a creative solution (mutually assured sabotage is a recent proposal in this vein)
4) Even if we can’t find a non-dictatorial solution ourselves, we might be able to defer to an aligned AI to find one for us
5) There’s the possibility of creating a “minimal singleton” that prevents folk from creating catastrophic risks and otherwise leaves humanity alone
6) There’s the possibility of telling an aligned AI to create a global democracy (laws could still vary by nation just as different states within a nation can have different laws). This would be very unilateralist, but arguably non-dictatorial.
7) Even if human judgment is too flawed to justify a pivotal act, pursuit of a singleton might be justified if an aligned AI told us that would be the only way to avoid catastrophe (obviously there are risks related to sychophancy here).
Ofc, many folk will find it hard to articulate this kind of broad intuition and it wouldn’t surprise me if this caused them to end up saying things that are rather vague.… Relatively few alignment folk explicitly endorse the singleton solution. They might not comprehensively rule it out, but very few folk want to go down that path, if there’s any way of avoiding it
… I agree that most of these are weak individually, but collectively I think they form a decent response.
You might also want to check out work from the Center on Long-Term Risk on fanaticism.
I suspect it’s a bit more nuanced that this. Factors include the size of the audience, how often you’re interjecting, the quality of the interjections, whether your pushing the presenter off track towards your own pet issue, whether you’re asking a clarifying question that other audience members found useful, how formal the conference is and how the speaker likes to run their sessions.
But the “role-playing game” glasses that you were wearing would have (understandably) made such a statement look like “flavor text”.
Quite possibly, but hard to know.
Fascinating, but I’m slightly confused about the link here. Any chance you could make the connection more explicit?