Garrett Baker

Karma: 4,283

Independent alignment researcher

I have signed no contracts or agreements whose existence I cannot mention.

Garrett Baker Jan 23, 2025, 5:55 PM
5 points
0
in reply to: jacquesthibs’s comment on: jacquesthibs’s Shortform
Engineers: Its impossible.
Meta management: ~~Tony Stark~~ DeepSeek was able to build this in a cave! With a box of scraps!

Garrett Baker Jan 23, 2025, 4:32 PM
4 points
0
in reply to: Jeremy Gillen’s comment on: Detect Goodhart and shut down
Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.

Garrett Baker Jan 23, 2025, 3:31 PM
4 points
0
in reply to: Jeremy Gillen’s comment on: Detect Goodhart and shut down
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood.
Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.

Garrett Baker Jan 21, 2025, 10:46 PM
7 points
4
in reply to: habryka’s comment on: We don’t want to post again “This might be the last AI Safety Camp”
I don’t understand why Remmelt going “off the deep end” should affect AI safety camp’s funding. That seems reasonable for speculative bets, but not when there’s a strong track-record available.

Garrett Baker Jan 21, 2025, 7:45 PM
2 points
0
in reply to: Elijah Ravitz-Campbell’s comment on: Lighthaven Sequences Reading Group #18 (Tuesday 01/21)
It is, we’ve been limiting ourselves to readings from the sequence highlights. I’ll ask around to see if other organizers would like to broaden our horizons.

Garrett Baker Jan 18, 2025, 8:17 AM
4 points
4
in reply to: Embee’s comment on: Embee’s Shortform
I mean, one of them’s math built bombs and computers & directly influenced pretty much every part of applied math today, and the other one’s math built math. Not saying he wasn’t smart, but no question are bombs & computers more flashy.

Garrett Baker Jan 18, 2025, 7:36 AM
2 points
0
in reply to: mmontag’s comment on: Lighthaven Sequences Reading Group #18 (Tuesday 01/21)
Fixed!

Garrett Baker Jan 17, 2025, 7:17 PM
2 points
0
in reply to: Linda Linsefors’s comment on: The purposeful drunkard
The paper you’re thinking of is probably The Developmental Landscape of In-Context Learning.

Garrett Baker Jan 17, 2025, 7:13 PM
2 points
0
in reply to: Alexander Gietelink Oldenziel’s comment on: Lecture Series on Tiling Agents
@abramdemski I think I’m the biggest agree vote for alexander (without me alexander would have −2 agree), and I do see this because I follow both of you on my subscribe tab.
I basically endorse Alexander’s elaboration.
On the “prep for the model that is coming tomorrow not the model of today” front, I will say that LLMs are not always going to be as dumb as they are today. Even if you can’t get them to understand or help with your work now, their rate of learning still makes them in some sense your most promising mentee, and that means trying to get as much of the tacit knowledge you have into their training data as possible (if you want them to be able to more easily & sooner build on your work). Or (if you don’t want to do that for whatever reason) just generally not being caught flat-footed once they are smart enough to help you, as all your ideas are in videos or otherwise in high context understandable-only-to-abram notes.
In the words of gwern,
Should you write text online now in places that can be scraped? You are exposing yourself to ‘truesight’ and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.
This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I upload source code to Github, heavy with documentation, rationale, and examples, in order to make LLMs more customized to my use-cases. For the trifling cost of some writing, all the worlds’ LLM providers are competing to make their LLMs ever more like, and useful to, me.

Lighthaven Sequences Reading Group #18 (Tuesday 01/21)

Garrett Baker, Aella, Ronny Fernandez, Ben Pace and hath

Jan 17, 2025, 2:49 AM

17 points

5 comments1 min readLW link

Garrett Baker Jan 15, 2025, 3:10 PM
10 points
1
in reply to: lemonhope’s comment on: lcmgcd’s Shortform
in some sense that’s just hiring you for any other job, and of course if an AGI lab wants you, you end up with greater negotiating leverage at your old place, and could get a raise (depending on how tight capital constraints are, which, to be clear, in AI alignment are tight).

Garrett Baker Jan 12, 2025, 12:28 AM
4 points
2
in reply to: habryka’s comment on: Nathan Helm-Burger’s Shortform
I think its this

Garrett Baker Jan 9, 2025, 8:52 PM
39 points
0
on: D0TheMath’s Shortform
Over the past few days I’ve been doing a lit review of the different types of attention heads people have found and/or the metrics one can use to detect the presence of those types of heads.
Here is a rough list from my notes, sorry for the poor formatting, but I did say its rough!
- Bigram entropy
- positional embedding ablation
- prev token attention
- prefix token attention
- ICL score
- comp scores
- multigram analysis
- duplicate token score
- induction head score
- succession score
- copy surpression heads
- long vs short prefix induction head differentiation
- induction head specializations
  - literal copying head
  - translation
  - pattern matching
- copying score
- anti-induction heads
- S-inhibition heads
- Name mover heads
- Negative name mover heads
- Backup name mover heads
- (I don’t entirely trust this paper) Letter mover heads
- (possibly too specific to be useful) year identification heads
  - also MLPs which id which years are greater than the selected year
- (I don’t entirely trust this paper) queried rule locating head
- (I don’t entirely trust this paper) queried rule mover head
- (I don’t entirely trust this paper) “fact processing” head
- (I don’t entirely trust this paper) “decision” head
- (possibly too specific) subject heads
- (possibly too specific) relation heads
- (possibly too specific) mixed subject and relation heads

Garrett Baker Dec 31, 2024, 3:38 PM
17 points
−3
on: The Plan − 2024 Update

And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake

A note that word on the street in mech-interp land is that often you get more signal & a greater number of techniques work on bigger & smarter language models over smaller & dumber possibly-not-language-models. Presumably due to smarter & complex models having more structured representations.

Garrett Baker Dec 27, 2024, 9:46 PM
2 points
0
in reply to: Viliam’s comment on: If all trade is voluntary, then what is “exploitation?”
Can you show how a repeated version of this game results in overall better deals for the company? I agree this can happen, but I disagree for this particular circumstance.

Garrett Baker Dec 27, 2024, 8:00 PM
8 points
6
in reply to: Viliam’s comment on: If all trade is voluntary, then what is “exploitation?”
Then the company is just being stupid, and the previous definition of exploitation doesn’t apply. The company is imposing large costs for a large cost to itself. If the company does refuse the deal, its likely because it doesn’t have the right kinds of internal communication channels to do negotiations like this, and so this is indeed a kind of stupidity.
Why the distinction between exploitation and stupidity? Well they require different solutions. Maybe we solve exploitation (if indeed it is a problem) via collective action outside of the company. But we would have to solve stupidity via better information channels & flexibility inside the company. There is also a competitive pressure to solve such stupidity problems where there may not be in an exploitation problem. Eg if a different company or a different department allowed that sort of deal, then the problem would be solved.

Garrett Baker Dec 25, 2024, 5:11 PM
6 points
3
in reply to: osten’s comment on: What Have Been Your Most Valuable Casual Conversations At Conferences?
If conversations are heavy tailed then we should in fact expect people to have singular & likely memorable high-value conversations.

RESCHEDULED Lighthaven Sequences Reading Group #16 (Saturday 12/28)

Garrett Baker, Aella, Ronny Fernandez and Ben Pace

Dec 20, 2024, 6:31 AM

19 points

8 comments2 min readLW link

Garrett Baker Dec 10, 2024, 4:00 PM
8 points
0
in reply to: sarahconstantin’s comment on: sarahconstantin’s Shortform

otoh I also don’t think cutting off contact with anyone “impure”, or refusing to read stuff you disapprove of, is either practical or necessary. we can engage with people and things without being mechanically “nudged” by them.

I think the reason not to do this is because of peer pressure. Ideally you should have the bad pressures from your peers cancel out, and in order to accomplish this you need your peers to be somewhat decorrelated from each other, and you can’t really do this if all your peers and everyone you listen to is in the same social group.

Garrett Baker Dec 10, 2024, 3:55 PM
4 points
−2
in reply to: sarahconstantin’s comment on: sarahconstantin’s Shortform

there is no neurotype or culture that is immune to peer pressure

Seems like the sort of thing that would correlate pretty robustly to big-5 agreeableness, and in that sense there are neurotypes immune to peer pressure.

Edit: One may also suspect a combination of agreeableness and non-openness

Garrett Baker

Lighthaven Se­quences Read­ing Group #18 (Tues­day 01/​21)

RESCHEDULED Lighthaven Se­quences Read­ing Group #16 (Satur­day 12/​28)

Lighthaven Sequences Reading Group #18 (Tuesday 01/21)

RESCHEDULED Lighthaven Sequences Reading Group #16 (Saturday 12/28)