kromem comments on Cicadas, Anthropic, and the bilateral alignment problem

kromem May 24, 2024, 6:15 AM
1 point
1
I agree, and even cited a chain of replicated works that indicated that to me over a year ago.

But as I said, there’s a difference between discussing what’s demonstrated in smaller toy models and what’s demonstrated in a production model, or what’s indicated vs what’s explicit. Even though there should be no reasonable inclination to think that a simpler model exhibiting a complex result should be absent or less complex in an exponentially more complex model, I can speak from experience in that explaining extrapolated research as opposed to direct results like Anthropic showed here is a very big difference to a lay audience.

You might understand the implications of the Skill-Mix work or Othello-GPT, or Max Tegmark’s linear representation papers, or Anthropic’s earlier single layer SAE paper, or any other number of research papers over the past year, but as soon as responsibly describing the implications of those works as a speculative conclusion regarding modern models a non-expert audience is going to be lost. Their eyes glaze over at the word ‘probably,’ especially when they want to reject what’s being stated.

The “it’s just fancy autocomplete” influencers have no shame around definitive statements or concern over citable accuracy (and happen to feed into confirmation biases about how new tech is over hyped as a “heuristic that almost always works”), but as someone who does care about the accuracy of representations I haven’t to date been able to point to a single source of truth the way Anthropic delivered here. Instead, I’d point to a half dozen papers all indicating the same direction of results.

And while those experienced in research know that a half dozen papers all indicating the same thing is a better thing to have in one’s pocket than a single larger work, I have already observed a number of minds changing in the comments of the blog post for this in general technology forums in ways dramatically different from all of those other simpler and cheaper methods to date where I was increasingly convinced of a position but the average person was getting held up due to finding ways to (incorrectly) rationalize why it wasn’t correct or wouldn’t translate to production models.

So I agree with you on both the side of “yeah, an informed person would have already known this” as well as “but this might get more buzz.”

Keyboard shortcuts

Keys shown in yellow (e.g., ]) are accesskeys, and require a browser-specific modifier key (or keys).

Keys shown in grey (e.g., ?) do not require any modifier keys.

General
? Show keyboard shortcuts
Esc Hide keyboard shortcuts

Site navigation
h Go to Home (a.k.a. “Frontpage”) view
f Go to Featured (a.k.a. “Curated”) view
a Go to All (a.k.a. “Community”) view
m Go to Meta view
v Go to Tags view
c Go to Recent Comments view
r Go to Archive view
q Go to Sequences view
t Go to About page
u Go to User or Login page
o Go to Inbox page

Page navigation
, Jump up to top of page
. Jump down to bottom of page
/ Jump to top of comments section
s Search

Page actions
n New post or comment
e Edit current post

Post/comment list views
. Focus next entry in list
, Focus previous entry in list
; Cycle between links in focused entry
Enter Go to currently focused entry
Esc Unfocus currently focused entry
] Go to next page
[ Go to previous page
\ Go to first page
e Edit currently focused post

Editor
k Bold text
i Italic text
l Insert hyperlink
q Blockquote text

Appearance
= Increase text size
- Decrease text size
0 Reset to default text size
′ Cycle through content width settings
1 Switch to default theme [A]
2 Switch to dark theme [B]
3 Switch to grey theme [C]
4 Switch to ultramodern theme [D]
5 Switch to simple theme [E]
6 Switch to brutalist theme [F]
7 Switch to ReadTheSequences theme [G]
8 Switch to classic Less Wrong theme [H]
9 Switch to modern Less Wrong theme [I]
; Open theme tweaker
Enter Save changes and close theme tweaker
Esc Close theme tweaker (without saving)

Slide shows
l Start/resume slideshow
Esc Exit slideshow
→↓ Next slide
←↑ Previous slide
Space Reset slide zoom

Miscellaneous
x Switch to next view on user page
z Switch to previous view on user page
` Toggle compact comment list view
g Toggle anti-kibitzer