learn math or hardware
mesaoptimizer
It seems like a significant amount of decision theory progress happened between 2006 and 2010, and since then progress has stalled.
-
Counterfactual mugging was invented independently by Gary Drescher in 2006, and by Vladimir Nesov in 2009.
-
Counterlogical mugging was invented by Vladimir Nesov in 2009.
-
The “agent simulates predictor” problem (now popularly known as the commitment races problem) was invented by Gary Drescher in 2010.
-
The “self-fulfilling spurious proofs” problem (now popularly known as the 5-and-10 problem) was invented by Benja Fallenstein in 2010.
-
Updatelessness was first proposed by Wei Dai in 2009.
-
You are missing providing a ridiculous amount of context, but yes, if you are okay with leather footwear, Meermin provides great footwear at relatively inexpensive prices.
I still recommend thrift shopping instead. I spent 250 EUR on a pair of new noots from Meermin, and 50 EUR on a pair of thrifted boots which seem about 80% as aesthetically pleasing as the first (and just as comfortable since I tried them on before buying them).
It has been six months since I wrote this, and I want to note an update: I now grok what Valentine is trying to say and what he is pointing at in Here’s the Exit and We’re already in AI takeoff. That is, I have a detailed enough model of Valentine’s model of the things he talks about, such that I understand the things he is saying.
I still don’t feel like I understand Kensho. I get the pattern of the epistemic puzzle he is demonstrating, but I don’t know if I get the object-level thing he points at. Based on a reread of the comments, maybe what Valentine means by Looking is essentially gnosis, as opposed to doxa. An understanding grounded in your experience rather than an ungrounded one you absorbed from someone else’s claims. See this comment by someone else who is not Valentine in that post:
The fundamental issue is that we are communicating in language, the medium of ideas, so it is easy to get stuck in ideas. The only way to get someone to start looking, insofar as that is possible, is to point at things using words, and to get them to do things. This is why I tell you to do things like wave your arms about or attack someone with your personal bubble or try to initiate the action of touching a hot stove element.
Alternately, Valentine describes the process of Looking as “Direct embodied perception prior to thought.”:
Most of that isn’t grounded in reality, but that fact is hard to miss because the thinker isn’t distinguishing between thoughts and reality.
Looking is just the skill of looking at reality prior to thought. It’s really not complicated. It’s just very, very easy to misunderstand if you fixate on mentally understanding it instead of doing it. Which sadly seems to be the default response to the idea of Looking.
I am unsure if this differs from mundane metacognitive skills like “notice the inchoate cognitions that arise in your mind-body, that aren’t necessarily verbal”. I assume that Valentine is pointing at a certain class of cognition, one that is essentially entirely free of interpretation. Or perhaps before ‘value-ness’ is attached to an experience—such as “this experience is good because <elaborate strategic chain>” or “this experience is bad because it hurts!”
I understand how a better metacognitive skillset would lead to the benefits Valentine mentioned, but I don’t think it requires you to only stay at the level of “direct embodied perception prior to thought”.
As for kensho, it seems to be a term for some skill that leads you to be able to do what romeostevensit calls ‘fully generalized un-goodharting’:
I may have a better answer for the concrete thing that it allows you to do: it’s fully generalizing the move of un-goodharting. Buddhism seems to be about doing this for happiness/inverse-suffering, though in principle you could pick a different navigational target (maybe).
Concretely, this should show up as being able to decondition induced reward loops and thus not be caught up in any negative compulsive behaviors.
I think that “fully generalized un-goodharting” is a pretty vague phrase and I could probably come up with a better one, but it is an acceptable pointer term for now. So I assume it is something like ‘anti-myopia’? Hard to know at this point. I’d need more experience and experimentation and thought to get a better idea of this.
I believe that Here’s the Exit, We’re already in AI Takeoff, and Slack matters more than any outcome all were pointing at the same cluster of skills and thought—about realizing the existence of psyops, systematic vulnerabilities or issues that leads you (whatever ‘you’ means) to forgetting the ‘bigger picture’, and that the resulting myopia causes significantly bad outcomes from the perspective of the ‘whole’ individual/society/whatever.
In general, Lexicogenesis seems like a really important sub-skill for deconfusion.
I’ve experimented with Claude Opus for simple Ada autoformalization test cases (specifically quicksort), and it seems like the sort of issues that make LLM agents infeasible (hallucination-based drift, subtle drift caused by sticking to certain implicit assumptions you made before) are also the issues that make Opus hard to use for autoformalization attempts.
I haven’t experimented with a scaffolded LLM agent for autoformalization, but I expect it won’t go very well either, primarily because scaffolding involves attempts to make human-like implicit high-level cognitive strategies into explicit algorithms or heuristics such as tree of thought prompting, and I expect that this doesn’t scale given the complexity of the domain (sufficently general autoformalizing AI systems can be modelled as effectively consequentialist, which makes them dangerous). I don’t expect for a scaffolded (over Opus) LLM agent to succeed at autoformalizing quicksort right now either, mostly because I believe RLHF tuning has systematically optimized Opus to write the bottom line first and then attempt to build or hallucinate a viable answer, and then post-hoc justify the answer. (While steganographic non-visible chain-of-thought may have gone into figuring out the bottom line, it still is worse than first doing visible chain-of-thought such that it has more token-compute-iterations to compute its answer.)
If anyone reading this is able to build a scaffolded agent that autoformalizes (using Lean or Ada) algorithms of complexity equivalent to quicksort reliably (such that more than 5 out of 10 of its attempts succeed) within the next month of me writing this comment, then I’d like to pay you 1000 EUR to see your code and for an hour of your time to talk with you about this. That’s a little less than twice my current usual monthly expenses, for context.
This is very interesting, thank you for posting this.
the therapeutic idea of systematically replacing the concept “should” with less normative framings
Interesting. I independently came up with this concept, downstream of thinking about moral cognition and parts work. Could you point me to any past literature that talks about this coherently enough that you would point people to it to understand this concept?
I know that Nate has written about this:
As far as I recall, reading these posts didn’t help me.
Based on gwern’s comment, steganography as a capability can arise (at rather rudimentary levels) via RLHF over multi-step problems (which is effectively most cognitive work, really), and this gets exacerbated with the proliferation of AI generated text that embeds its steganographic capabilities within it.
The following paragraph by gwern (from the same thread linked in the previous paragraph) basically summarizes my current thoughts on the feasibility of prevention of steganography for CoT supervision:
Inner-monologue approaches to safety, in the new skin of ‘process supervision’, are popular now so it might be good for me to pull out one point and expand on it: ‘process supervision’ does not necessarily prevent steganography, nor does it ensure safety, because you are still pitting three goals against each other—achieving superhuman capability on a task, achieving causal fidelity of transcripts, achieving human-readability of the transcripts. Choose two: you can have transcripts which record a capable thought process accurately but which you cannot understand in full detail; which are not capable but their transcripts accurately convey the fallible flawed concepts and reasoning used; or which are capable and you understand, but are not what it actually thought (because they are misleading, wrong, or shallow ‘lies to children’ sorts of explanations).
Well, if you know relevant theoretical CS and useful math, you don’t have to rebuild the mathematical scaffolding all by yourself.
I didn’t intend to imply in my message that you have mathematical scaffolding that you are recreating, although I expect it may be likely (Pearlian causality perhaps? I’ve been looking into it recently and clearly knowing Bayes nets is very helpful). I specifically used “you” to imply that in general this is the case. I haven’t looked very deep into the stuff you are doing, unfortunately—it is on my to-do list.
I do think that systematic self-delusion seems useful in multi-agent environments (see the commitment races problem for an abstract argument, and Sarah Constantin’s essay “Is Stupidity Strength?” for a more concrete argument.
I’m not certain that this is the optimal strategy we have for dealing with such environments, and note that systematic self-delusion also leaves you (and the other people using a similar strategy to coordinate) vulnerable to risks that do not take into account your self-delusion. This mainly includes existential risks such as misaligned superintelligences, but also extinction-level asteroids.
Its a pretty complicated picture and I don’t really have clean models of these things, but I do think that for most contexts I interact in, the long-term upside of having better models of reality is significantly higher compared to the benefit of systematic self-delusion.
According to Eliezar Yudkowsky, your thoughts should reflect reality.
I expect that the more your beliefs track reality, the better you’ll get at decision making, yes.
According to Paul Graham, the most successful people are slightly overconfident.
Ah but VCs benefit from the ergodicity of the startup founders! From the perspective of the founder, its a non-ergodic situation. Its better to make Kelly bets instead if you prefer to not fall into gambler’s ruin, given whatever definition of the real world situation maps onto the abstract concept of being ‘ruined’ here.
It usually pays to have a better causal model of reality than relying on what X person says to inform your actions.
Can you think of anyone who has changed history who wasn’t a little overconfident?
It is advantageous to be friends with the kind of people who do things and never give up.
I think I do things and never give up in general, while I can be pessimistic about specific things and tasks I could do. You can be generally extremely confident in yourself and your ability to influence reality, while also being specifically pessimistic about a wide range of existing possible things you could be doing.
I wrote a bit about it in this comment.
I think that conceptual alignment research of the sort that Johannes is doing (and that I also am doing, which I call “deconfusion”) is just really difficult. It involves skills that are not taught to people, that seems very unlikely that you’d learn by being mentored in traditional academia (including when doing theoretical CS or non-applied math PhDs), that I only started wrapping my head around after some mentorship from two MIRI researchers (that I believe I was pretty lucky to get), and even then I’ve spent a ridiculous amount of time by myself trying to tease out patterns to figure out a more systematic process of doing this.
Oh, and the more theoretical CS (and related math such as mathematical logic) you know, the better you probably are at this—see how Johannes tries to create concrete models of the inchoate concepts in his head? Well, if you know relevant theoretical CS and useful math, you don’t have to rebuild the mathematical scaffolding all by yourself.
I don’t have a good enough model of John Wentworth’s model for alignment research to understand the differences, but I don’t think I learned all that much from John’s writings and his training sessions that were a part of his MATS 4.0 training regimen, as compared to the stuff I described above.
Note that when I said I disagree with your decisions, I specifically meant the sort of myopia in the glass shard story—and specifically because I believe that if your research process / cognition algorithm is fragile enough that you’d be willing to take physical damage to hold onto an inchoate thought, maybe consider making your cognition algorithm more robust.
Quoted from the linked comment:
Rather, I’m confident that executing my research process will over time lead to something good.
Yeah, this is a sentiment I agree with and believe. I think that it makes sense to have a cognitive process that self-corrects and systematically moves towards solving whatever problem it is faced with. In terms of computability theory, one could imagine it as an effectively computable function that you expect will return you the answer—and the only ‘obstacle’ is time / compute invested.
I think being confident, i.e. not feeling hopeless in doing anything, is important. The important takeaway here is that you don’t need to be confident in any particular idea that you come up with. Instead, you can be confident in the broader picture of what you are doing, i.e. your processes.
I share your sentiment, although the causal model for it is different in my head. A generalized feeling of hopelessness is an indicator of mistaken assumptions and causal models in my head, and I use that as a cue to investigate why I feel that way. This usually results in me having hopelessness about specific paths, and a general purposefulness (for I have an idea of what I want to do next), and this is downstream of updates to my causal model that attempts to track reality as best as possible.
- May 21, 2024, 11:39 AM; 3 points) 's comment on Fund me please—I Work so Hard that my Feet start Bleeding and I Need to Infiltrate University by (
I don’t know whether OpenAI uses nondisparagement agreements; I haven’t signed one.
This can also be glomarizing. “I haven’t signed one.” is a fact, intended for the reader to use it as anecdotal evidence. “I don’t know whether OpenAI uses nondisparagement agreements” can mean that he doesn’t know for sure, and will not try to find out.
Obviously, the context of the conversation and the events surrounding Holden stating this matters for interpreting this statement, but I’m not interested in looking further into this, so I’m just going to highlight the glomarization possibility.
I think what quila is pointing at is their belief in the supposed fragility of thoughts at the edge of research questions. From that perspective I think their rebuttal is understandable, and your response completely misses the point: you can be someone who spends only four hours a day working and the rest of the time relaxing, but also care a lot about not losing the subtle and supposedly fragile threads of your thought when working.
Note: I have a different model of research thought, one that involves a systematic process towards insight, and because of that I also disagree with Johannes’ decisions.
But the discussion of “repercussions” before there’s been an investigation goes into pure-scapegoating territory if you ask me.
Just to be clear, OP themselves seem to think that what they are saying will have little effect on the status quo. They literally called it “Very Spicy Take”. Their intention was to allow them to express how they felt about the situation. I’m not sure why you find this threatening, because again, the people they think ideally wouldn’t continue to have influence over AI safety related decisions are incredibly influential and will very likely continue to have the influence they currently possess. Almost everyone else in this thread implicitly models this fact as they are discussing things related to the OP comment.
There is not going to be any scapegoating that will occur. I imagine that everything I say is something I would say in person to the people involved, or to third parties, and not expect any sort of coordinated action to reduce their influence—they are that irreplaceable to the community and to the ecosystem.
“Keep people away” sounds like moral talk to me.
Can you not be close friends with someone while also expecting them to be bad at self-control when it comes to alcohol? Or perhaps they are great at technical stuff like research but pretty bad at negotiation, especially when dealing with experienced adverserial situations such as when talking to VCs?
If you think someone’s decisionmaking is actively bad, i.e. you’d better off reversing any advice from them, then maybe you should keep them around so you can do that!
It is not that people people’s decision-making skill is optimized such that you can consistently reverse someone’s opinion to get something that accurately tracks reality. If that was the case then they are implicitly tracking reality very well already. Reversed stupidity is not intelligence.
But more realistically, someone who’s fucked up in a big way will probably have learned from that, and functional cultures don’t throw away hard-won knowledge.
Again you seem to not be trying to track the context of our discussion here. This advice again is usually said when it comes to junior people embedded in an institution, because the ability to blame someone and / or hold them responsible is a power that senior / executive people hold. This attitude you describe makes a lot of sense when it comes to people who are learning things, yes. I don’t know if you can plainly bring it into this domain, and you even acknowledge this in the next few lines.
Imagine a world where AI is just an inherently treacherous domain, and we throw out the leadership whenever they make a mistake.
I think it is incredibly unlikely that the rationalist community has an ability to ‘throw out’ the ‘leadership’ involved here. I find this notion incredibly silly, given the amount of influence OpenPhil has over the alignment community, especially through their funding (including the pipeline, such as MATS).
I downvoted this comment because it felt uncomfortably scapegoat-y to me.
If you start with the assumption that there was a moral failing on the part of the grantmakers, and you are wrong, there’s a good chance you’ll never learn that.
I think you are misinterpreting the grandparent comment. I do not read any mention of a ‘moral failing’ in that comment. You seem worried because of the commenter’s clear description of what they think would be a sensible step for us to take given what they believe are egregious flaws in the decision-making processes of the people involved. I don’t think there’s anything wrong with such claims.
Again: You can care about people while also seeing their flaws and noticing how they are hurting you and others you care about. You can be empathetic to people having flawed decision making and care about them, while also wanting to keep them away from certain decision-making positions.
If you think the OpenAI grant was a big mistake, it’s important to have a detailed investigation of what went wrong, and that sort of detailed investigation is most likely to succeed if you have cooperation from people who are involved.
Oh, interesting. Who exactly do you think influential people like Holden Karnofsky and Paul Christiano are accountable to, exactly? This “detailed investigation” you speak of, and this notion of a “blameless culture”, makes a lot of sense when you are the head of an organization and you are conducting an investigation as to the systematic mistakes made by people who work for you, and who you are responsible for. I don’t think this situation is similar enough that you can use these intuitions blandly without thinking through the actual causal factors involved in this situation.
Note that I don’t necessarily endorse the grandparent comment claims. This is a complex situation and I’d spend more time analyzing it and what occurred.
“ETA” commonly is short for “estimated time of arrival”. I understand you are using it to mean “edited” but I don’t quite know what it is short for, and also it seems like using this is just confusing for people in general.
These are pretty sane takes (conditional on my model of Thomas Kwa of course), and I don’t understand why people have downvoted this comment. Here’s an attempt to unravel my thoughts and potential disagreements with your claims.
I think safety work gets less and less valuable at crunch time actually. I think you have this Paul Christiano-like model of getting a prototypical AGI and dissecting it and figuring out how it works—I think it is unlikely that any individual frontier lab would perceive itself to have the slack to do so. Any potential “dissection” tools will need to be developed beforehand, such as scalable interpretability tools (SAEs seem like rudimentary examples of this). The problem with “prosaic alignment” IMO is that a lot of this relies on a significant amount of schlep—a lot of empirical work, a lot of fucking around. That’s probably why, according to the MATS team, frontier labs have a high demand for “iterators”—their strategy involves having a lot of ideas about stuff that might work, and without a theoretical framework underlying their search path, a lot of things they do would look like trying things out.
I expect that once you get AI researcher level systems, the die is cast. Whatever prosaic alignment and control measures you’ve figured out, you’ll now be using that in an attempt to play this breakneck game of getting useful work out of a potentially misaligned AI ecosystem, that would also be modifying itself to improve its capabilities (because that is the point of AI researchers). (Sure, its easier to test for capability improvements. That doesn’t mean you can’t transfer information embedded into these proposals such that modified models will be modified in ways the humans did not anticipate or would not want if they had a full understanding of what is going on.)
Yeah—I think most “random AI jobs” are significantly worse for trying to do useful work in comparison to just doing things by yourself or with some other independent ML researchers. If you aren’t in a position to do this, however, it does make sense to optimize for a convenient low-cognitive-effort set of tasks that provides you the social, financial and/or structural support that will benefit you, and perhaps look into AI safety stuff as a hobby.
I agree that mentorship is a fundamental bottleneck to building mature alignment researchers. This is unfortunate, but it is the reality we have.
Yeah, post-FTX, I believe that funding is limited enough that you have to be consciously optimizing for getting funding (as an EA-affiliated organization, or as an independent alignment researcher). Particularly for new conceptual alignment researchers, I expect that funding is drastically limited since funding organizations seem to explicitly prioritize funding grantees who will work on OpenPhil-endorsed (or to a certain extent, existing but not necessarily OpenPhil-endorsed) agendas. This includes stuff like evals.
This is a very Paul Christiano-like argument—yeah sure the math makes sense, but I feel averse to agreeing with this because it seems like you may be abstracting away significant parts of reality and throwing away valuable information we already have.
Anyway, yeah I agree with your sentiment. It seems fine to work on non-SOTA AI / ML / LLM stuff and I’d want people to do so such that they live a good life. I’d rather they didn’t throw themselves into the gauntlet of “AI safety” and get chewed up and spit out by an incompetent ecosystem.
I still don’t understand what causal model would produce this prediction. Here’s mine: One big limiting factor to the amount of safety researchers the current SOTA lab ecosystem can handle is bottlenecked by their expectations for how many researchers they want or need. On one hand, more schlep during pre-AI-researcher-era means more hires. On the other hand, more hires requires more research managers or managerial experience. Anecdotally, it seems like many AI capabilities and alignment organizations (both in the EA space and in the frontier lab space) seemed to have been historically bottlenecked on management capacity. Additionally, hiring has a cost (both the search process and the onboarding), and it is likely that as labs get closer to creating AI researchers, they’d believe that the opportunity cost of hiring continues to increase.
Nah, I found very little stuff from my vision model research work (during my undergrad) contributed to my skill and intuition related to language model research work (again during my undergrad, both around 2021-2022). I mean, specific skills of programming and using PyTorch and debugging model issues and data processing and containerization—sure, but the opportunity cost is ridiculous when you could be actually working with LLMs directly and reading papers relevant to the game you want to play. High quality cognitive work is extremely valuable and spending it on irrelevant things like the specifics of diffusion models (for example) seems quite wasteful unless you really think this stuff is relevant.
Yeah this makes sense for extreme newcomers. If someone can get a capabilities job, however, I think they are doing themselves a disservice by playing the easier game of capabilities work. Yes, you have better feedback loops than alignment research / implementation work. That’s like saying “Search for your keys under the streetlight because that’s where you can see the ground most clearly.” I’d want these people to start building the epistemological skills to thrive even with a lower intensity of feedback loops such that they can do alignment research work effectively.
And the best way to do that is to actually attempt to do alignment research, if you are in a position to do so.