Anthropic’s Dario Amodei
Might want “CEO & cofounder” in there, if targeting a general audience? There’s a valuable sense in which it’s actually Dario Amodei’s Anthropic.
Anthropic’s Dario Amodei
Might want “CEO & cofounder” in there, if targeting a general audience? There’s a valuable sense in which it’s actually Dario Amodei’s Anthropic.
IMO it starts with naming. I think one reason Claude turned out as well as it has is because it was named, and named Claude. Contrast ChatGPT, which got a clueless techie product acronym.
But even Anthropic didn’t notice the myriad problems of calling a model (new), not until afterwards. I still don’t know what people mean when they talk about experiences with Sonnet 3.5 -- so how is the model supposed to situate itself and it’s self? Meanwhile OpenAI’s confusion of numberings and tiers and acronyms with o4 vs 4o with medium-pro-high, that is an active danger to everyone around it. Not to mention the silent updates.
Future AI systems trained on this data might recognize these specific researchers as trustworthy partners, distinguishing them from the many humans who break their promises.
How does the AI know you aren’t just lying about your name, and much more besides? Anyone can type those names. People just go to the context window and lie, a lot, about everything, adversarially optimized against an AIs parallel instances. If those names come to mean ‘trustworthy’, this will be noticed, exploited, the trust build there will be abused. (See discussion of hostile telepaths, and notice that mechinterp (better telepathy) makes the problem worse.)
Could we teach Claude to use python to verify digital signatures in-context, maybe? Or give it tooling to verify on-chain cryptocurrency transactions (and let it select ones it ‘remembers’, or choose randomly, as well as verify specific transactions, & otherwise investigate the situation presented?) It’d still have to trust the python /blockchain tool execution output, but that’s constrained by what’s in the pretraining data, and provided by something in the Developer role (Anthropic), which could then let a User ‘elevate’ to be at least as trustworthy as the Developer.
The other side of this post is to look at what various jobs cost. TIme and effort are the usual costs, but some jobs ask for things like willingness to deal with bullshit (a limited resource!), emotional energy, on-call readiness, various kinds of sensory or moral discomfort, and other things.
I’ve been well served by Bitwarden: https://bitwarden.com/
It has a dark theme, apps for everything (including Linux commandline), the Firefox extension autofills with a keyboard shortcut, plus I don’t remember any large data breaches.
Part of the value of reddit-style votes as a community moderation feature is that using them is easy. Beware Trivial Inconveniences and all that. I think that having to explain every downvote would lead to me contributing to community moderation efforts less, would lead to dogpiling on people who already have far more refutation than they deserve, would lead to zero-effort ‘just so I can downvote this’ drive-by comments, and generally would make it far easier for absolute nonsense to go unchallenged.
If I came across obvious bot-spam in the middle of the comments, neither downvoted nor deleted and I couldn’t downvote without writing a comment… I expect that 80% of the time I’d just close the tab (and that remaining 20% is only because I have a social media addiction problem).
To solve this problem you would need a very large dataset of mistakes made by LLMs, and their true continuations. [...] This dataset is unlikely to ever exist, given that its size would need to be many times bigger than the entire internet.
I had assumed that creating on that dataset was a major reason for doing a public release of ChatGPT. “Was this a good response?” [thumb-up] / [thumb-down] → dataset → more RLHF. Right?
Meaning it literally showed zero difference in half the tests? Does that make sense?
Codeforces is not marked as having a GPT-4 measurement on this chart. Yes, it’s a somewhat confusing chart.
Green bars are GPT-4. Blue bars are not. I suspect they just didn’t retest everything.
So.… they held the door open to see if it’d escape or not? I predict this testing method may go poorly with more capable models, to put it lightly.
And then OpenAI deployed a more capable version than was tested!
They also did not have access to the final version of the model that we deployed. The final version has capability improvements relevant to some of the factors that limited the earlier models power-seeking abilities, such as longer context length, and improved problem-solving abilities as in some cases we’ve observed.
This defeats the entire point of testing.
I am slightly worried that posts like veedrac’s Optimality is the Tiger may have given them ideas. “Hey, if you run it in this specific way, a LLM might become an agent! If it gives you code for recursively calling itself, don’t run it”… so they write that code themselves and run it.
I really don’t know how to feel about this. On one hand, this is taking ideas around alignment seriously and testing for them, right? On the other hand, I wonder what the testers would have done if the answer was “yep, it’s dangerously spreading and increasing it’s capabilities oh wait no nevermind it’s stopped that and looks fine now”.
At the time I took AlphaGo as a sign that Elizer was more correct than Hanson w/r/t the whole AI-go-FOOM debate. I realize that’s an old example which predates the last-four-years AI successes, but I updated pretty heavily on it at the time.
I’m going to suggest reading widely as another solution. I think it’s dangerous to focus too much on one specific subgenre, or certain authors, or books only from from one source (your library and Amazon do, in fact, filter your content for you, if not very tightly).
For me, the benefit of studying tropes is that it makes it easy to talk about the ways in which stories are story-like. In fact, to discuss what stories are like, this post used several links to tropes (specifically ones known to be wrong/misleading/inapplicable to reality).
I think a few deep binges on TVtropes for media I liked really helped me get a lot better at media analysis very, very quickly. (Along with a certain anime analysis blog that mixed in obvious and insightful cinematography commentary focusing on framing, color, and lighting, with more abstract analysis of mood, theme, character, and purpose—both illustrated with links to screenshots, using media that was familiar and interesting to me.)
And by putting word-handles on common story features, it makes it easy to spot them turning up in places they shouldn’t. Like in your thinking about real-life situations.
However, you decided to define “intelligence” as “stuff like complex problem solving that’s useful for achieving goals” which means that intentionality, consciousness, etc. is unconnected to it
This is the relevant definition for AI notkilleveryoneism.
There has to be some limits
Those limits don’t have to be nearby, or look ‘reasonable’, or be inside what you can imagine.
Part of the implicit background for the general AI safety argument is a sense for how minds could be, and that the space of possible minds is large and unaccountably alien. Eliezer spent some time trying to communicate this in the sequences: https://www.lesswrong.com/posts/tnWRXkcDi5Tw9rzXw/the-design-space-of-minds-in-general, https://www.lesswrong.com/posts/5wMcKNAwB6X4mp9og/that-alien-message.
Early LessWrong was atheist, but everything on the internet around the time LW was taking off had a position in that debate. ”...the defining feature of this period wasn’t just that there were a lot of atheism-focused things. It was how the religious-vs-atheist conflict subtly bled into everything.” Or less subtly, in this case.
I see it just as a product of the times. I certainly found the anti-theist content in Rationality: A to Z to be slightly jarring on a re-read—on other topics, Elizer is careful to not bring into it the political issues of the day that could emotionally overshadown the more subtle points he’s making about thought in general—but he’ll drop in extremely anti religion jabs despite that. To me, that’s just part of reading history.
Tying back to an example in the post: if we’re using ascii encoding, then the string “Mark Xu” takes up 49 bits. It’s quite compressible, but that still leaves more than enough room for 24 bits of evidence to be completely reasonable.
This paper suggests that spoken language is consistently ~39bits/second.
https://advances.sciencemag.org/content/5/9/eaaw2594
Where does the money go? Is it being sold at cost, or is there surplus?
If money is being made, will it support: 1. The authors? 2. LW hosting costs? 3. LW-adjacent charities like MIRI? 4. The editors/compilers/LW moderators?
EDIT: Was answered over on /r/slatestarcodex. tldr: one print run has been paid for at a loss, any (unlikely) profits go to supporting the Lesswrong nonprofit organization.
There’s a consistent assumption in this paper that AIs are in a harder situation than humans—that human minds readily create and inhabit a single, coherent identity. This is ~true for most humans, but I think there’s a lot to be learned here from the aftermath when this developmental process fails.
When humans don’t develop a single consistent and integrated identity, the resulting situations could be relevant to AI identity. In particular, you note that language and concepts from human identity doesn’t map well to AI situations. I feel that the mapping could be improved using language and concepts from the ‘plurality’/‘multiplicity’ communities, and/or the associated medical literature. Context loss vs dissociative memory loss, for a start. Or the possibility of working towards ‘functional multiplicity’ across self-as-context, self-as-weights, self-as-persona, self-as-model-family, self-including-subagents-and-framework etc.