Independent alignment researcher
I have signed no contracts or agreements whose existence I cannot mention.
Independent alignment researcher
I have signed no contracts or agreements whose existence I cannot mention.
Although I don’t think the first example is great, seems more like a capability/observation-bandwidth issue.
I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.
If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood.
Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.
I don’t understand why Remmelt going “off the deep end” should affect AI safety camp’s funding. That seems reasonable for speculative bets, but not when there’s a strong track-record available.
It is, we’ve been limiting ourselves to readings from the sequence highlights. I’ll ask around to see if other organizers would like to broaden our horizons.
I mean, one of them’s math built bombs and computers & directly influenced pretty much every part of applied math today, and the other one’s math built math. Not saying he wasn’t smart, but no question are bombs & computers more flashy.
Fixed!
The paper you’re thinking of is probably The Developmental Landscape of In-Context Learning.
@abramdemski I think I’m the biggest agree vote for alexander (without me alexander would have −2 agree), and I do see this because I follow both of you on my subscribe tab.
I basically endorse Alexander’s elaboration.
On the “prep for the model that is coming tomorrow not the model of today” front, I will say that LLMs are not always going to be as dumb as they are today. Even if you can’t get them to understand or help with your work now, their rate of learning still makes them in some sense your most promising mentee, and that means trying to get as much of the tacit knowledge you have into their training data as possible (if you want them to be able to more easily & sooner build on your work). Or (if you don’t want to do that for whatever reason) just generally not being caught flat-footed once they are smart enough to help you, as all your ideas are in videos or otherwise in high context understandable-only-to-abram notes.
Should you write text online now in places that can be scraped? You are exposing yourself to ‘truesight’ and also to stylometric deanonymization or other analysis, and you may simply have some sort of moral objection to LLM training on your text.
This seems like a bad move to me on net: you are erasing yourself (facts, values, preferences, goals, identity) from the future, by which I mean, LLMs. Much of the value of writing done recently or now is simply to get stuff into LLMs. I would, in fact, pay money to ensure Gwern.net is in training corpuses, and I upload source code to Github, heavy with documentation, rationale, and examples, in order to make LLMs more customized to my use-cases. For the trifling cost of some writing, all the worlds’ LLM providers are competing to make their LLMs ever more like, and useful to, me.
in some sense that’s just hiring you for any other job, and of course if an AGI lab wants you, you end up with greater negotiating leverage at your old place, and could get a raise (depending on how tight capital constraints are, which, to be clear, in AI alignment are tight).
Over the past few days I’ve been doing a lit review of the different types of attention heads people have found and/or the metrics one can use to detect the presence of those types of heads.
Here is a rough list from my notes, sorry for the poor formatting, but I did say its rough!
Bigram entropy
positional embedding ablation
prev token attention
prefix token attention
ICL score
comp scores
multigram analysis
duplicate token score
induction head score
long vs short prefix induction head differentiation
induction head specializations
literal copying head
translation
pattern matching
copying score
(I don’t entirely trust this paper) Letter mover heads
(possibly too specific to be useful) year identification heads
also MLPs which id which years are greater than the selected year
(I don’t entirely trust this paper) queried rule locating head
(I don’t entirely trust this paper) queried rule mover head
(I don’t entirely trust this paper) “fact processing” head
(I don’t entirely trust this paper) “decision” head
(possibly too specific) subject heads
(possibly too specific) relation heads
(possibly too specific) mixed subject and relation heads
And yes, I do think that interp work today should mostly focus on image nets for the same reasons we focus on image nets. The field’s current focus on LLMs is a mistake
A note that word on the street in mech-interp land is that often you get more signal & a greater number of techniques work on bigger & smarter language models over smaller & dumber possibly-not-language-models. Presumably due to smarter & complex models having more structured representations.
Can you show how a repeated version of this game results in overall better deals for the company? I agree this can happen, but I disagree for this particular circumstance.
Then the company is just being stupid, and the previous definition of exploitation doesn’t apply. The company is imposing large costs for a large cost to itself. If the company does refuse the deal, its likely because it doesn’t have the right kinds of internal communication channels to do negotiations like this, and so this is indeed a kind of stupidity.
Why the distinction between exploitation and stupidity? Well they require different solutions. Maybe we solve exploitation (if indeed it is a problem) via collective action outside of the company. But we would have to solve stupidity via better information channels & flexibility inside the company. There is also a competitive pressure to solve such stupidity problems where there may not be in an exploitation problem. Eg if a different company or a different department allowed that sort of deal, then the problem would be solved.
If conversations are heavy tailed then we should in fact expect people to have singular & likely memorable high-value conversations.
otoh I also don’t think cutting off contact with anyone “impure”, or refusing to read stuff you disapprove of, is either practical or necessary. we can engage with people and things without being mechanically “nudged” by them.
I think the reason not to do this is because of peer pressure. Ideally you should have the bad pressures from your peers cancel out, and in order to accomplish this you need your peers to be somewhat decorrelated from each other, and you can’t really do this if all your peers and everyone you listen to is in the same social group.
there is no neurotype or culture that is immune to peer pressure
Seems like the sort of thing that would correlate pretty robustly to big-5 agreeableness, and in that sense there are neurotypes immune to peer pressure.
Edit: One may also suspect a combination of agreeableness and non-openness
Engineers: Its impossible.
Meta management:
Tony StarkDeepSeek was able to build this in a cave! With a box of scraps!