I’m trying to prevent doom from AI. Currently trying to become sufficiently good at alignment research. Feel free to DM for meeting requests.
Towards_Keeperhood
Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.
Did you consider to instead commit to giving out retroactive funding for research progress that seems useful?
Aka that people could apply for funding for anything done from 2025, and then you can actually better evaluate how useful some research was, rather than needing to guess in advance how useful a project might be. And in a way that quite impactful results can be paid a lot, so you don’t disincentivize low-chance-high-reward strategies. And so we get impact market dynamics where investors can fund projects in exchange for a share of the retroactive funding in case of success.
There are difficulties of course. Intuitively this retroactive approach seems a bit more appealing to me, but I’m basically just asking whether you considered it and if so why you didn’t go with it.
Applications (here) start with a simple 300 word expression of interest and are open until April 15, 2025. We have plans to fund $40M in grants and have available funding for substantially more depending on application quality.
Side question: How much is Openphil funding LTFF? (And why not more?)
(I recently got an email from LTFF which suggested that they are quite funding constraint. And I’d intuitively expect LTFF to be higher impact per dollar than this, though I don’t really know.)
I created an obsidian Templater template for the 5-minute version of this skill. It inserts the following list:
how could I have thought that faster?
recall—what are key takeaways/insights?
trace—what substeps did I do?
review—how could one have done it (much) faster?
what parts were good?
where did i have wasted motions? what mistakes did i make?
generalize lesson—how act in future?
what are example cases where this might be relevant?
Here’s the full template so it inserts this at the right level of indentation. (You can set a shortcut for inserting this template. I use “Alt+h”.)
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length) + “—how could I have thought that faster?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + “—recall—what are key takeaways/insights?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “- ” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + “—trace—what substeps did I do?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “- ” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + “—review—how could one have done it (much) faster?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “- ” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “—what parts were good?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “—where did i have wasted motions? what mistakes did i make?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 1) + “—generalize lesson—how act in future?” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “- ” %>
<% “\t”.repeat(tp.file.content.split(“\n”)[app.workspace.activeLeaf?.view?.editor.getCursor().line].match(/^\t*/)[0].length + 2) + “—what are example cases where this might be relevant?” %>
I now want to always think of concrete examples where a lesson might become relevant in the next week/month, instead of just reading them.
As of a couple of days ago, I have a file where I save lessons from such review exercises for reviewing them periodically.
Some are in weekly review category and some in monthly review. Every day when I do my daily recall I now also check through the lessons in the corresponding weekday and monthday tag.
Here’s how my file currently looks like:
(I use some short codes for typing faster like “W=what”, “h=how”, “t=to”, “w=with” and maybe some more.)- Mon
- [[lesson—clarify Gs on concrete examples]]
- [[lesson—delegate whenever you can (including if possible large scale responsibilities where you need to find someone competent and get funding)]]
- [[lesson—notice when i search for facts (e.g. w GPT) (as opposed to searching for understanding) and then perhaps delegate if possible]]
- Tue
- [[lesson—do not waste time on designing details that i might want to change later]]
- [[periodic reminder—stop and review what you’d do if you had pretty unlimited funding → if it could speed you up, then perhaps try to find some]]
- Wed
- [[lesson—try to find edge cases where your current model does not work well]]
- notice when sth worked well (you made good progress) → see h you did that (-> generalize W t do right next time)
- Thu
- it’s probably useless/counterproductive to apply effort for thinking. rather try to calmly focus your attention.
- perhaps train to energize the thing you want to think about like a swing through resonance. (?)
- Fri
- [[lesson—first ask W you want t use a proposal for rather than directly h you want proposal t look like]]
- Sat
- [[lesson—start w simple plan and try and rv and replan, rather than overoptimize t get great plan directly]]
- Sun
- group
- plan for particular (S)G h t achieve it rather than find good general methodology for a large class of Gs
- [[lesson—when possible t get concrete example (or observations) then get them first before forming models or plans on vague ideas of h it might look like]]
- 1
- don’t dive too deep into math if you don’t want to get really good understanding (-> either get shallow or very deep model, not half-deep)
- 2
- [[lesson—take care not to get sidetracked by math]]
- 3
- [[lesson—when writing an important message or making a presentation, imagine what the other person will likely think]]
- 4
- [[lesson—read (problem statements) precisely]]
- 5
- perhaps more often ask myself “Y do i blv W i blv?” (e.g. after rc W i think are good insights/plans)
- 6
- sometimes imagine W keepers would want you to do
- 7
- group
- beware conceptual limitations you set yourself
- sometimes imagine you were smarter
- 8
- possible tht patts t add
- if PG not clear → CPG
- if G not clear → CG
- if not sure h continue → P
- if say sth abstract → TBW
- if say sth general → E (example)
- 9
- ,rc methodology i want t use (and Y)
- Keltham methodology.
- loop: pr → gather obs → carve into subprs → attack a subpr
- 10
- reminder of insights:
- hyp that any model i have needs t be able t be applied on examples (?)
- disentangle habitual execution from model building (??)
- don’t think too abstractly. see underlying structure to be able t carve reality better. don’t be blinded by words. TBW.
- don’t ask e.g. W concepts are, but just look at observations and carve useful concepts anew.
- form models of concrete cases and generalize later.
- 11
- always do introspection/rationality-training and review practices. (except maybe in some sprints.)
- 12
- Wr down questions towards the end of a session. Wr down questions after having formed some takeaway. (from Abram)
- 13
- write out insights more in math (from Abram)
- 14
- periodically write out my big picture of my research (from Abram)
- 15
- Hoops. first clarify observations. note confusions. understand the problem.
- 16
- have multiple hypotheses. including for plans as hypotheses of what’s the best course of action.
- 17
- actually fucking backchain. W are your LT Gs.
- 18
- 19
- 20
- 21
- 22
- 23
- 24
- 25
- 26
- 27
- 28
- 29
- 30
- read https://www.lesswrong.com/posts/f2NX4mNbB4esdinRs/towards_keeperhood-s-shortform?commentId=D66XSCkv6Sxwwyeep
Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
This just seems doomed to me. The training runs will be even more expensive, the difficulty of doing anything significant as an outsider ever-higher. If the eventual plan is to get big labs to listen to your research, then isn’t it better to start early? (If you have anything significant to say, of course.)
I’d imagine it not too hard to get >1OOM efficiency improvement which one can demonstrate in smaller AI and one might use this to get a lab to listen. If the labs are sufficiently uninterested in alignment it’s pretty doomy anyway even if they adopted a better paradigm.
Also government interventions might still happen (perhaps more likely because of AI-caused unemployment than x-risk, and it won’t buy amazingly much time, but still).
Also the strategy of “maybe if AIs are more rational they will solve alignment or at least realize that they cannot” seems also very unlikely to me to work on the current DL paradigm, though still slightly helpful.
(Also maybe some supergenius or my future self or some other group can figure something out.)
I don’t think that. See the bottom part of the comment you’re replying to. (The part after “Here’s what I would say instead:”)
Sry my comment was sloppy.
Right, my point is, I don’t see any difference between “AIs that produce slop” and “weak AIs” (a.k.a. “dumb AIs”).
(I agree the way I used sloppy in my comment mostly meant “weak”. But some other thoughts:)
So I think there are some dimensions of intelligence which are more important for solving alignment than for creating ASI. If you read planecrash, WIS and rationality training seem to me more important in that way than INT.
I don’t really have much hope for DL-like systems solving alignment but a similar case might be if an early transformative AI recognizes and says “no I cannot solve the alignment problem. the way my intelligence is shaped is not well suited to avoiding value drift. we should stop scaling and take more time where I work with very smart people like Eliezer etc for some years to solve alignment”. And depending on the intelligence profile of the AI it might be more or less likely that this will happen (currently seems quite unlikely).
But overall those “better” intelligence dimensions still seem to me too central for AI capabilities, so I wouldn’t publish stuff.(Btw the way I read John’s post was more like “fake alignment proposals are a main failure mode” rather than also ”… and therefore we should work on making AIs more rational/sane whatever”. So given that I maybe would defend John’s framing, but not sure.)
So the lab implements the non-solution, turns up the self-improvement dial, and by the time anybody realizes they haven’t actually solved the superintelligence alignment problem (if anybody even realizes at all), it’s already too late.
If the AI is producing slop, then why is there a self-improvement dial? Why wouldn’t its self-improvement ideas be things that sound good but don’t actually work, just as its safety ideas are?
Because you can speed up AI capabilities much easier while being sloppy than to produce actually good alignment ideas.
If you really think you need to be similarly unsloppy to build ASI than to align ASI, I’d be interested in discussing that. So maybe give some pointers to why you might think that (or tell me to start).
Thanks for providing a concrete example!
Belief propagation seems too much of a core of AI capability to me. I’d rather place my hope on GPT7 not being all that good yet at accelerating AI research and us having significantly more time.
I also think the “drowned out in the noise” isn’t that realistic. You ought to be able to show some quite impressive results relative to computing power used. Though when you maybe should try to convince the AI labs of your better paradigm is going to be difficult to call. It’s plausible to me we won’t see signs that make us sufficiently confident that we only have a short time left, and it’s plausible we do.
In any case before you publish something you can share it with trustworthy people and then we can discuss that concrete case in detail.
Btw tbc, sth that I think slightly speeds up AI capability but is good to publish is e.g. producing rationality content for helping humans think more effectively (and AIs might be able to adopt the techniques as well). Creating a language for rationalists to reason in more Bayesian ways would probably also be good to publish.
Can you link me to what you mean by John’s model more precisely?If you mean John’s slop-instead-scheming post, I agree with that with the “slop slightly more likely than scheming” part. I might need to reread John’s post to see what the concrete suggestions for what to work on might be. Will do so tomorrow.
I’m just pessimistic that we can get any nontrivially useful alignment work out of AIs until a few months before the singularity, at least besides some math. Or like at least for the parts of the problem we are bottlenecked on.
So like I think it’s valuable to have AIs that are near the singularity be more rational. But I don’t really buy the differentially improving alignment thing. Or like could you make a somewhat concrete example of what you think might be good to publish?
Like, all capabilities will help somewhat with the AI being less likely to make errors that screw its alignment. Which ones do you think are more important than others? There would have to be a significant difference in usefulness pf some capabilities, because else you could just do the same alignment work later and still have similarly much time to superintelligence (and could get more non-timeline-upspeeding work done).
Thanks.
True, I think your characterization of tiling agents is better. But my impression was sorta that this self-trust is an important precursor for the dynamic self-modification case where alignment properties need to be preserved through the self-modification. Yeah I guess calling this AI solving alignment is sorta confused, though maybe there’s sth into this direction because the AI still does the search to try to preserve the alignment properties?
Hm I mean yeah if the current bottleneck is math instead of conceptualizing what math has to be done then it’s a bit more plausible. Like I think it ought to be feasible to get AIs that are extremely good at proving theorems and maybe also formalizing conjectures. Though I’d be a lot more pessimistic about finding good formal representations for describing/modelling ideas.
Do you think we are basically only bottlenecked on math so sufficient math skill could carry us to aligned AI, or only have some alignment philosophy overhang you want to formalize but then more philosophy will be needed?
What kind of alignment research do you hope to speed up anyway?
For advanced philosophy like stuff (e.g. finding good formal representations for world models, or inventing logical induction) they don’t seem anywhere remotely close to being useful.
My guess would be for tiling agents theory neither but I haven’t worked on it, so very curious on your take here. (IIUC, to some extent the goal of tiling-agents-theory-like work there was to have an AI solve it’s own alignment problem. Not sure how far the theory side got there and whether it could be combined with LLMs.)
Or what is your alignment hope in more concrete detail?
This argument might move some people to work on “capabilities” or to publish such work when they might not otherwise do so.
Above all, I’m interested in feedback on these ideas. The title has a question mark for a reason; this all feels conjectural to me.
My current guess:
I wouldn’t expect much useful research to come from having published ideas. It’s mostly just going to be used in capabilities and it seems like a bad idea to publish stuff.
Sure you can work on it and be infosec cautious and keep it secret. Maybe share it with a few very trusted people who might actually have some good ideas. And depending on how things play out if in a couple years there’s some actual effort from the joined collection of the leading labs to align AI and they only have like 2-8 months left before competition will hit the AI improving AI dynamic quite hard, then you might go to the labs and share your ideas with them (while still trying to keep it closed within those labs—which will probably only work for a few months or a year or so until there’s leakage).
Due to the generosity of ARIA, we will be able to offer a refund proportional to attendance, with a full refund for completion. The cost of registration is $200, and we plan to refund $25 for each week attended, as well as the final $50 upon completion of the course. We’ll ask participants to pay the registration fee once the cohort is finalized, so no fee is required to fill out the application form below.
Wait so do we get a refund if we decide we don’t want to do the course, or if we manage to complete the course?
Like is it a refund in the “get your money back if you don’t like it” sense, or is it incentive to not sign up and then not complete the course?
Nice post!
My key takeaway: “A system is aligned to human values if it tends to generate optimized-looking stuff which is aligned to human values.”
I think this is useful progress. In particular it’s good to try to aim for the AI to produce some particular result in the world, rather than trying to make the AI have some goal—it grounds you in the thing you actually care about in the end.
I’d say the ”… aligned to human values part” is still underspecified (and I think you at least partially agree):
“aligned”: how does the ontology translation between the representation of the “generated optimized-looking stuff” and the representation of human values look like?
“human values”
I think your model of humans is too simplistic. E.g. at the very least it’s lacking a distinction like between “ego-syntonic” and “voluntary” as in this post, though I’d probably want a even significantly more detailed model. Also one might need different models for very smart and reflective people than for most people.
We haven’t described value extrapolation.
(Or from an alternative perspective, our model of humans doesn’t identify their relevant metapreferences (which probably no human knows fully explicitly, and for some/many humans it they might not be really well defined).)
Positive reinforcement for first trying to better understand the problem before running off and trying to solve it! I think that’s the way to make progress, and I’d encourage others to continue work on more precisely defining the problem, and in particular on getting better models of human cognition to identify how we might be able to rebind the “human values” concept to a better model of what’s happening in human minds.
Btw, I’d have put the corrigibility section into a separate post, it’s not nearly up to the standards of the rest of this post.
To set expectations: this post will not discuss …
Maybe you want to add here that this is not meant to be an overview of alignment difficulties, or an explanation for why alignment is hard.
Agree on that people focus a bit too much on scheming. It might be good for some people to think a bit more about the other failure modes you described, but the main thing that needs doing is very smart people making progress towards building an aligned AI, not defending against particular failure modes. (However, most people probably cannot usefully contribute to that, so maybe focusing on failure modes is still good for most people. Only that in any case there’s the problem that people will find proposals that very likely don’t actually work but which people can rather believe in that they work, and thereby making an AI stop a bit less likely.)
In general, I wish more people would make posts about books without feeling the need to do boring parts they are uninterested in (summarizing and reviewing) and more just discussing the ideas they found valuable. I think this would lower the friction for such posts, resulting in more of them. I often wind up finding such thoughts and comments about non-fiction works by LWers pretty valuable. I have more of these if people are interested.
I liked this post, thanks and positive reinforcement. In case you didn’t already post your other book notes, just letting you know I’d be interested.
Do we have a sense for how much of the orca brain is specialized for sonar?
I don’t know.
But evolution slides functions around on the cortical surface, and (Claude tells me) association areas like the prefrontal cortex are particularly prone to this.
It’s particularly bad for cetaceans. Their functional mapping looks completely different.
Thanks. Yep I agree with you, some elaboration:
(This comment assumes you at least read the basic summary of my project (or watched the intro video).)
I know of Earth Species Project (ESP) and CETI (though I only read 2 publications of ESP and none of CETI).
I don’t expect them to succeed in something equivalent to decoding orca language to an extent that we could communicate with them almost as richly as they communicate among each other. (Though like, if long-range sperm whales signals are a lot simpler they might be easier to decode.)
From what I’ve seen, they are mostly trying to throw AI at stuff and hoping somehow they will understand stuff, without having a clear plan how to actually decode it. The AI stuff might look advanced but it’s sorta obvious things to try and I think it’s unlikely to work very well, though still glad they are trying this.
If you look at orca vocalizations, it looks complex and alien. The patterns we can currently recognize there look very different from what we’d be able to see in an unknown human language. The embedding mapping might be useful if we had to decode a human language, and maybe we still learn some useful stuff from it, but for orca language we don’t even know what their analog of words and sentences are and maybe their language works even somewhat differently (though I’d guess if they are smarter than humans there’s probably going to be something like words and sentences—but they might be encoded differently in the signals than in human languages).
Though definitely plausible that AI can help significantly with decoding animal languages, but I think it also needs forming deep understanding of some things and I think it’s likely too hard for ESP to succeed anytime soon, though like possible a supergenius could do it in a few years, but it would be really impressive.
My approach may fail, especially if orcas aren’t at least roughly human-level smart, but it has the advantage that we can show orcas precise context of what some words and sentences mean, whereas we basically have almost no context data on recordings of orca vocalizations, so it’s easier for them to see what some signals mean than for humans to infer what orca vocalizations mean. (Even if we had a lot of video datasets with vocalizations (which we don’t), it’s still a lot less context information about what they are talking about, than if they could show us images to indicate what they would talk about.) Of course humans have more research experience and better tools for decoding signals, but it doesn’t look to me like anyone is currently remotely close, and my approach is much quicker to try and might have at least a decent chance. (I mean it nonzero worked with bottlenose dolphins (in terms of grammar better than with great apes), though I’d be a lot more ambitious.)
Of course, the language I create will also be alien for orcas, but I think if they are good enough at abstract pattern recognition they might still be able to learn it.
The meta problem of consciousness is about explaining why people think they are conscious.
Even if we get such a result with AIs where AIs invent a concept like consciousness from scratch, that would only tell us that they also think they have sth that we call consciousness, but not yet why they think this.
That is, unless we can somehow precisely inspect the cognitive thought processes that generated the consciousness concept in AIs, which on anything like the current paradigm we won’t be.
Another way to frame it: Why would it matter that an AI invents the concept of consciousness, rather than another human? Where is the difference that lets us learn more about the hard/meta problem of consciousness in the first place?
Separately, even if we could analyze the thought processes of AIs in such a case so we would solve the meta problem of consciousness by seeing explanations of why AIs/people talk about consciousness the way they do, that doesn’t mean you already have solved the meta-problem of consciousness now.
Aka just because you know it’s solvable doesn’t mean you’re done. You haven’t solved it yet. Just like the difference between knowing that general relativity exists and understanding the theory and math.