Claude learns across different chats. What does this mean?
I was asking Claude 3 Sonnet “what is a PPU” in the context of this thread. For that purpose, I pasted part of the thread.
Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.
I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).
This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.
In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.
Does this mean Claude somehow learns across different chats (inside the same user account)? If so, this might not happen through a process as naive as “append previous chats as the start of the prompt, with a certain indicator that they are different”, but instead some more effective distillation of the important information from those chats. Do we have any information on whether and how this happens?
(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason “more intelligent” and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)
I’ve checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn’t have it. Claude also explicitly states it doesn’t have cross-chat memory, when asked about it. Might something happen like “it does have some chat memory, but it’s told not to acknowledge this fact, but it sometimes slips”?
Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has its own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won’t be as many due to heavy-tailedness in the distribution and the fact you don’t start at 50%. But I do believe that evidence is coming in every week such that ideal market prices should move by 1% on maybe half of weeks, and it is not crazy for your probabilities to shift by 1% during many weeks if you think about it.
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven’t followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.
Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
How many people would, if they suddenly died, be reported as a “Boeing whistleblower”? The lower this number is, the more surprising the death.
Another HN commenter says (in a different thread):
It’s a nice little math problem.
Let’s say both of the whistleblowers were age 50. The probability of a 50 year old man dying in a year is 0.6%. So the probability of 2 or more of them dying in a year is 1 - (the probability of exactly zero dying in a year + the probability of exactly one dying in a year). 1 - (A+B).
A is (1-0.006)^N. B is 0.006N(1-0.006)^(N-1). At 60 A is about 70% and B is about 25% making it statistically insignificant.
But they died in the same 2 month period, so that 0.006 should be 0.001. If you rerun the same calculation, it’s 356.
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
eg crutchfield talks a lot about developing a right notion of information flow—obvious usefulness in eg formalizing boundaries?
many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.
Epsilon machine (and MSP) construction is most likely computationally intractable [I don’t know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios.
Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited.
The calculi of emergence paper has inspired many people but has not been developed much. Many of the details are somewhat obscure, vague. I also believe that most likely completely different methods are needed to push the program further. Computational Mechanics’ is primarily a theory of hidden markov models—it doesn’t have the tools to easily describe behaviour higher up the Chomsky hierarchy. I suspect more powerful and sophisticated algebraic, logical and categorical thinking will be needed here. I caveat this by saying that Paul Riechers has pointed out that actually one can understand all these gadgets up the Chomsky hierarchy as infinite HMMs which may be analyzed usefully just as finite HMMs.
The still-underdeveloped theory of epsilon transducers I regard as the most promising lens on agent foundations. This is uncharcted territory; I suspect the largest impact of computational mechanics will come from this direction.
Your point on True Names is well-taken. More basic examples than gauge information, synchronization order are the triple of quantites entropy rate h, excess entropy E and Crutchfield’s statistical/forecasting complexity C. These are the most important quantities to understand for any stochastic process (such as the structure of language and LLMs!)
For example:
The common saying, “Anything worth doing is worth doing [well/poorly]” needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.
Better V1: “The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment.”
Better V2: “If an action is beneficial, a flawed attempt may be preferable to none at all.”
However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I’m much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works.
This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.)
I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored.
Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about:
It’s very expensive to refrain from using AIs for this application.
There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances.
If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things:
It implies that work on mitigating these risks should focus on this very specific setting.
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.
Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?
For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?
One operationalization is “these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs”.
As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.
With these caveats:
The speed up is relative to the current status quo as of GPT-4.
The speed up is ignoring the “speed up” of “having better experiments to do due to access to better models” (so e.g., they would complete a fixed research task faster).
By “capable” of speeding things up this much, I mean that if AIs “wanted” to speed up this task and if we didn’t have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I’m uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.
I’m uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven’t yet been that widely deployed due to inference availability issues, so actual production hasn’t increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).
So, it’s hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.
It is pretty plausible to me that AI control is quite easy
I think it depends on how you’re defining an “AI control success”. If success is defined as “we have an early transformative system that does not instantly kill us– we are able to get some value out of it”, then I agree that this seems relatively easy under the assumptions you articulated.
If success is defined as “we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period”, then this seems much harder.
The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it’s trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it’s less cautious or because it feels like it needs to cut corners to catch up– either doesn’t want to implement the control techniques or it’s fine implementing the control techniques but it plans to be less cautious around when we’re ready to scale up to GPT-9.
I think it’s fine to say “the control agenda is valuable even if it doesn’t solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn’t cause a catastrophe.” But this has a different vibe than “AI control is quite easy”, even if that statement is technically correct.
(Also, please do point out if there’s some way in which the control agenda “solves” or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)
When I said “AI control is easy”, I meant “AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes”; I wasn’t trying to comment more generally. I agree with your concern.
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
I didn’t get this from the premises fwiw. Are you saying it’s trivial because “just don’t use your AI to help you design AI” (seems organizationally hard to me), or did you have particular tricks in mind?
Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.
In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.
This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] which impedes long term goals.
This suggests that any system that possesses both a set of goals and the capacity to understand how events influence their probability of achieving such goals should posses a capacity to feel pain. This also suggests that the amount of pain is proportional to the degree of “setbacks” and the degree to which “setbacks” are perceived.
I think this is a relatively robust argument for the inherent reality of pain not just in a broad spectrum biological organisms, but also in synthetic [including sufficiently advanced AI] agents.
We should strive to reduce the pain we cause in the agents we interact with.
I think pain is a little bit different than that. It’s the contrast between the current state and the goal state. This constrast motivates the agent to act, when the pain of contrast becomes bigger than the (predicted) pain of acting.
As a human, you can decrase your pain by thinking that everything will be okay, or you can increase your pain by doubting the process. But it is unlikely that you will allow yourself to stop hurting, because your brain fears that a lack of suffering would result in a lack of progress (some wise people contest this, claiming that wu wei is correct).
Another way you can increase your pain is by focusing more on the goal you want to achieve, sort of irritating/torturing yourself with the fact that the goal isn’t achieved, to which your brain will respond by increasing the pain felt by the contrast, urging action.
Do you see how this differs slightly from your definition? Chronic pain is not a continuous reduction in agency, but a continuous contrast between a bad state and a good state, which makes one feel pain which motivates them to solve it (exercise, surgery, resting, looking for painkillers, etc). This generalizes to other negative feelings, for instance to hunger, which exists with the purpose to be less pleasant than the search for food is, such that you seek food.
I warn you that avoiding negative emotions can lead to stagnation, since suffering leads to growth (unless we start wireheading, and making the avoidance of pain our new goal, because then we might seek hedonic pleasures and intoxicants)
I would certainly agree with part of what you are saying. Especially the point that many important lessons are taught by pain [correct me if this is misinterpreting your comment]. Indeed, as a parent for example, if your goal is for your child to gain the capacity for self sufficiency, a certain amount of painful lessons that reflect the inherent properties of the world are necessary to achieve such a goal.
On the other hand, I do not agree with your framing of pain as being the main motivator [again, correct me if required]. In fact, a wide variety of systems in the brain are concerned with calculating and granting rewards. Perhaps pain and pleasure are the two sides of the same coin, and reward maximisation and regret minimisation are identical. In practice however, I think they often lead to different solutions.
I also do not agree with your interpretation that chronic pain does not reduce agency. For family members of mine suffering from arthritis, their chronic pain renders them unable to do many basic activities, for example, access areas for which you would need to climb stairs. I would like to emphasise that it is not the disease which limits their “degrees of freedom” [at least in the short term], and were they to take a large amount of painkillers, they could temporarily climb stairs again.
Finally, I would suggest that your framing as a “contrast between the current state and the goal state” is basically an alternative way of talking about the transition probability from the current state to the goal state. In my opinion, this suggests that our conceptualisations of pain are overwhelmingly similar.
I think all criticism, all shaming, all guilt tripping, all punishments and rewards directed at children—is for the purpose of driving them to do certain actions. If your children do what you think is right, there’s no need to do much of anything.
A more general and correct statement would be “Pain is for the sake of change, and all change is painful”. But that change is for the sake of actions. I don’t think that’s too much of a simplification to be useful.
I think regret, too, is connected here. And there’s certainly times when it seems like pain is the problem rather than an attempt to solve it, but I think that’s a misunderstanding. And while chronic pain does reduce agency, it’s a constant pain and a constant reduction of agency (not cumulative). The pain persists until the problem is solved, even if the problem does not get worse. So it’s the body telling the brain “Hey, do something about this, the importance is 50 units of pain”, then you will do anything to solve it as long there’s a path with less than 50 units of pain which leads to a solution.
The pain does limit agency, but not because it’s a real limitation. It’s an artificial one that the body creates to prevent you from damaging yourself. So all important agency is still possible. If the estimated consequences of avoiding the task is more painful than doing the task, you do it. But it’s again the body is just estimating the cost/benefit of tasks and choosing the optimal action by making it the least painful action.
My explanation and yours are almost identical, but there’s some important differences. In my view, suffering is good, not bad. I really don’t want humanity to misunderstand this one fact, it has already had profound negative consequences. It’s phantom damage created to avoid real damage. An agent which is unable to feel physical pain and exhaustion would destroy itself, therefore physical pain and exhaustion are valuable and not problems to be solved. Emotions like suffering, exhaustion, annoyance, etc. function the same as physical pain, and once they get over a certain threshold they coerce you into taking an action. Physical pain comes from nerves, but emotional pain comes from your interpretation of reality. Your brain relies on you to tell what ought to be painful (so if you overestimate risk, it just believes you). And you don’t get to choose all your goals yourself, your brain wants you to fulfill your needs (prioritized by the hierarchy of needs). In short, the brain makes inaction painful, while keeping actions that it deems risky painful, and then messes with the weights/thresholds according to need. Just like with hunger (not eating is painful, but if all you have is stale or even moldy bread, then you need to be very hungry before you eat, and you will eat iff pain(hunger)>pain(eating the bread)).
An increase in power/agency feels a lot like happiness though, even according to Nietzsche who I’m not confident to argue against, so I get why you’d basically think that opposite of happiness is the opposite of agency (sorry if this summary does injustice to your point)
In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.
How many organisms other than humans have “long term goals”? Doesn’t that require a complex capacity for mental representation of possible future states?
Am I wrong in assuming that the capacity to experience “pain” is independent of an explicit awareness of what possibilities have been shifted as a result of the new sensory data? (i.e. having a limb cleaved from the rest of the body, stubbing your toe in the dark). The organism may not even be aware of those possibilities, only ‘aware’ of pain.
Note: I’m probably just having a fear of this sounding all too teleological and personifying evolution
Check my math: how does Enovid compare to to humming?
Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…
Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it’s more complicated. I’ve got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…
so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116….
Enovid stings and humming doesn’t, so it seems like Enovid should have the larger dose. But the spray doesn’t contain NO itself, but compounds that react to form NO. Maybe that’s where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.
I’m not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.
With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing.
There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.
Where I’m most likely wrong:
misinterpreted the dosage in the RCT
dosage in RCT is lower than in Enovid
Enovid’s dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum.
I found the gotcha: envoid has two other mechanisms of action. Someone pointed this out to me on my previous nitric oxide post, but it didn’t quite sink in till I did more reading.
I think that’s their guess but they don’t directly check here.
I also suspect that it doesn’t matter very much.
The sinuses have so much NO compared to the nose that this probably doesn’t materially lower sinus concentrations.
the power of humming goes down with each breath but is fully restored in 3 minutes, suggesting that whatever change happens in the sinsues is restored quickly
From my limited understanding of virology and immunology, alternating intensity of NO between sinuses and nose every three minutes is probably better than keeping sinus concentrations high[1]. The first second of NO does the most damage to microbes[2], so alternation isn’t that bad.
I’d love to test this. The device you linked works via the mouth, and we’d need something that works via the nose. From a quick google it does look like it’s the same test, so we’d just need a nasal adaptor.
Other options:
Nnoxx. Consumer skin device, meant for muscle measurements
There are lots of devices for measuring concentration in the air, maybe they could be repurporsed. Just breathing on it might be enough for useful relative metrics, even if they’re low-precision.
I’m also going to try to talk my asthma specialist into letting me use their oral machine to test my nose under multiple circumstances, but it seems unlikely she’ll go for it.
obvious question: so why didn’t evolution do that? Ancestral environment didn’t have nearly this disease (or pollution) load. This doesn’t mean I’m right but it means I’m discounting that specific evolutionary argument.
Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.
This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.
“Mathematical descriptions” is a little ambiguous. Equations and models are terse. The mapping of such equations to human-level system expectations (anticipated conditional experiences) can require quite a bit of verbosity.
I think that’s what you’re saying with the “algorithms and data structures” part, but I’m unsure if you’re claiming that the property specification of the math is sufficient as a description, and comparable in fidelity to the algorithmic implementation.
The Model-View-Controller architecture is very powerful. It allows us to separate concerns.
For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.
We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.
If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the underlying data that the original algorithm is working on. These could be parameters to the original algorithm, which would modify the runtime behavior (e.g. we could change the maximum search depth for BFS). It could also change the current data the algorithm is working on (e.g. in quicksort we could change the pivot, or smaller_than_list just before they are set). The distinction is somewhat arbitrary. If we were to step through some Python code with a debugger, we could just set any variables in the program.
Usually, people think of something much “narrower” when they think about the Model-View-Controller-Architecture.
We could also do the same for a mathematical description. We can write down some mathematically well-defined thing and then separately think about how we can visualize this thing. And then again, separately, we can think about how would we interact with this thing.
A very rough draft of a plan to test prophylactics for airborne illnesses.
Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.
Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in the winter to make this work.
Splitting controls is probably the hardest part. You’d like the control and treatment group to be identical, but there are a lot of things that affect susceptibility. Age, local vs. air travel, small children vs. not, sleep habits… it’s hard to draw the line
Make it logistically trivial to use the treatment. If it’s lozenges or liquids, put individually packed dosages in every bathroom, with a sign reminding people to use them (color code to direct people to the right basket). If it’s a nasal spray you will need to give everyone their own bottle, but make it trivial to get more if someone loses theirs.
Follow-up a week later, asking if people have gotten sick and when.
If the natural disease load is high enough this should give better data than any paper I’ve found.
This sounds like a bad plan because it will be a logistics nightmare (undermining randomization) with high attrition, and extremely high variance due to between-subject design (where subjects differ a ton at baseline, in addition to exposure) on a single occasion with uncontrolled exposures and huge measurement error where only the most extreme infections get reported (sometimes). You’ll probably get non-answers, if you finish at all. The most likely outcome is something goes wrong and the entire effort is wasted.
Since this is a topic which is highly repeatable within-person (and indeed, usually repeats often through a lifetime...), this would make more sense as within-individual and using higher-quality measurements.
One good QS approach would be to exploit the fact that infections, even asymptomatic ones, seem to affect heart rate etc as the body is damaged and begins fighting the infection. HR/HRV is now measurable off the shelf with things like the Apple Watch, AFAIK. So you could recruit a few tech-savvy conference-goers for measurements from a device they already own & wear. This avoids any ‘big bang’ and lets you prototype and tweak on a few people—possibly yourself? - before rolling it out, considerably de-risking it.
There are some people who travel constantly for business and going to conferences, and recruiting and managing a few of them would probably be infinitely easier than 500+ randos (if for no reason other than being frequent flyers they may be quite eager for some prophylactics), and you would probably get far more precise data out of them if they agree to cooperate for a year or so and you get eg 10 conferences/trips out of each of them which you can contrast with their year-round baseline & exposome and measure asymptomatic infections or just overall health/stress. (Remember, variance reduction yields exponential gains in precision or sample-size reduction. It wouldn’t be too hard for 5 or 10 people to beat a single 250vs250 one-off experiment, even if nothing whatsoever goes wrong in the latter. This is a case where a few hours writing simulations to do power analysis on could be very helpful. I bet that the ability to detect asymptomatic cases, and run within-person, will boost statistical power a lot more than you think compared to ad hoc questionnaires emailed afterwards which may go straight to spam...)
I wonder if you could also measure the viral load as a whole to proxy for the viral exposome through something like a tiny air filter, which can be mailed in for analysis, like the exposometer? Swap out the exposometer each trip and you can measure load as a covariate.
I don’t really know what people mean when they try to compare “capabilities advancements” to “safety advancements”. In one sense, its pretty clear. The common units are “amount of time”, so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that’s a capabilities advance, and should not have been done. Yet I think there’s a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don’t often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe—eg, because they don’t understand in detail how to reason about whether it is or not—are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don’t have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren’t. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don’t personally see how it’s exfohaz. And it won’t be apparent until afterwards that it was capabilities, not alignment.
So just don’t publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god’s sake stop accidentally helping people create green nodes because you can’t see five inches ahead. And don’t send it to a capabilities team before it’s able to guarantee moral alignment hard enough to make a red-proof yellow node!
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they’re working on in alignment, we’d make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to “well, no I think you’re wrong. This proposal will slow down alignment too much” is why I think we need numbers to ground us.
Yeah, I agree that releasing open-weights non-frontier models doesn’t seem like a frontier capabilities advance.
It does seem potentially like an open-source capabilities advance.
That can be bad in different ways.
Let me pose a couple hypotheticals.
What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it’s a dangerous thing to have open-weight models catching up.
What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren’t sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.
So, as we get closer to danger, open-weight models take on more safety significance.
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.
Obviously such numbers aren’t the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.
If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn’t exactly my main wheelhouse).
A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”[1] In particular, here is a candidate definition of a shard theoretic policy:
A policy has shards if it implements at least two “motivational circuits” (shards) which can independently activate (more precisely, the shard activation contexts arecompositionally represented).
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.
For illustration, what would be an example of having different shards for “I get food” (F) and “I see my parents again” (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?
I think this is also what I was confused about—TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren’t always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don’t understand this.
@jessicata once wrote “Everyone wants to be a physicalist but no one wants to define physics”. I decided to check SEP article on physicalism and found that, yep, it doesn’t have definition of physics:
Carl Hempel (cf. Hempel 1969, see also Crane and Mellor 1990) provided a classic formulation of this problem: if physicalism is defined via reference to contemporary physics, then it is false — after all, who thinks that contemporary physics is complete? — but if physicalism is defined via reference to a future or ideal physics, then it is trivial — after all, who can predict what a future physics contains? Perhaps, for example, it contains even mental items. The conclusion of the dilemma is that one has no clear concept of a physical property, or at least no concept that is clear enough to do the job that philosophers of mind want the physical to play.
<...>
Perhaps one might appeal here to the fact that we have a number of paradigms of what a physical theory is: common sense physical theory, medieval impetus physics, Cartesian contact mechanics, Newtonian physics, and modern quantum physics. While it seems unlikely that there is any one factor that unifies this class of theories, perhaps there is a cluster of factors — a common or overlapping set of theoretical constructs, for example, or a shared methodology. If so, one might maintain that the notion of a physical theory is a Wittgensteinian family resemblance concept.
This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same.
Perhaps my personal definition of physics is inspired by Engels’s “Dialectics of Nature”: “Motion is the mode of existence of matter.” Assuming “matter is described by physics,” we are getting “physics is the science that reduces studied phenomena to motion.” Or, to express it in a more analytical manner, “a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time.”
For example, “vacuum” is a part of space with a “zero” value in all characteristics. A “particle” is a localized part of space with some non-zero characteristic. A “wave” is part of space with periodic changes of some characteristic in time and/or space. We can abstract away “part of space” from “particle” and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time.
The tricky part is, “Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?”
Let’s imagine that we have some kind of “vitalist field.” This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they’ll die.
Despite having a “vitalist field,” such a world would be pretty much physicalist.
An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement.
The difference is that the “vitalist field” in the second case has its own dynamics not reducible to any spatial characteristics of the “vitalist field”; it has an “inner life.”
I think some of the AI safety policy community has over-indexed on the visual model of the “Overton Window” and under-indexed on alternatives like the “ratchet effect,” “poisoning the well,” “clown attacks,” and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable).
I’m not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of “Overton Window-moving” strategies executed in practice have larger negative effects via associating their “side” with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies.
In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea “outside the window” and this actually makes the window narrower. But I think the visual imagery of “windows” actually struggles to accommodate this—when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences.
Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
I’m not a decel, but the way this stuff often is resolved is that there are crazy people that aren’t taken seriously by the managerial class but that are very loud and make obnoxious asks. Think the evangelicals against abortion or the Columbia protestors.
Then there is some elite, part of the managerial class, that makes reasonable policy claims. For Abortion, this is Mitch McConnel, being disciplined over a long period of time in choosing the correct judges. For Palestine, this is Blinken and his State Department bureaucracy.
The problem with decels is that theoretically they are part of the managerial class themselves. Or at least, they act like they are. They call themselves rationalists, read Eliezer and Scott Alexander, and what not. But the problem is that it’s very hard for an uninterested third party to take seriously these Overton Window bogous claims from people that were supposed to be measured, part of the managerial class.
You need to split. There are the crazy ones that people don’t take seriously, but will move the managerial class. And there are the serious people that EA Money will send to D.C. to work at Blumenthal’s office. This person needs to make small policy requests that will sabotage IA, without looking so. And slowly, you get policy wins and you can sabotage the whole effort.
Agree with lots of this– a few misc thoughts [hastily written]:
I think the Overton Window frame ends up getting people to focus too much on the dimension “how radical is my ask”– in practice, things are usually much more complicated than this. In my opinion, a preferable frame is something like “who is my target audience and what might they find helpful.” If you’re talking to someone who makes it clear that they will not support X, it’s silly to keep on talking about X. But I think the “target audience first” approach ends up helping people reason in a more sophisticated way about what kinds of ideas are worth bringing up. As an example, in my experience so far, many policymakers are curious to learn more about intelligence explosion scenarios and misalignment scenarios (the more “radical” and “speculative” threat models).
I don’t think it’s clear that the more effective actors in DC tend to be those who look for small wins. Outside of the AIS community, there sure do seem to be a lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate. Whether or not these organizations end up having more or less influence than the more “centrist” groups is, in my view, not a settled question & probably varies a lot by domain. In AI safety in particular, I think my main claim is something like “pretty much no group– whether radical or centrist– has had tangible wins. When I look at the small set of tangible wins, it seems like the groups involved were across the spectrum of “reasonableness.”
The more I interact with policymakers, the more I’m updating toward something like “poisoning the well doesn’t come from having radical beliefs– poisoning the well comes from lamer things like being dumb or uninformed, wasting peoples’ time, not understanding how the political process works, not having tangible things you want someone to do, explaining ideas poorly, being rude or disrespectful, etc.” I’ve asked ~20-40 policymakers (outside of the AIS bubble) things like “what sorts of things annoy you about meetings” or “what tends to make meetings feel like a waste of your time”, and no one ever says “people come in with ideas that are too radical.” The closest thing I’ve heard is people saying that they dislike it when groups fail to understand why things aren’t able to happen (like, someone comes in thinking their idea is great, but then they fail to understand that their idea needs approval from committee A and appropriations person B and then they’re upset about why things are moving slowly). It seems to me like many policy folks (especially staffers and exec branch subject experts) are genuinely interested in learning more about the beliefs and worldviews that have been prematurely labeled as “radical” or “unreasonable” (or perhaps such labels were appropriate before chatGPT but no longer are).
A reminder that those who are opposed to regulation have strong incentives to make it seem like basically-any-regulation is radical/unreasonable. An extremely common tactic is for industry and its allies to make common-sense regulation seem radical/crazy/authoritarian & argue that actually the people proposing strong policies are just making everyone look bad & argue that actually we should all rally behind [insert thing that isn’t a real policy.] (I admit this argument is a bit general, and indeed I’ve made it before, so I won’t harp on it here. Also I don’t think this is what Trevor is doing– it is indeed possible to raise serious discussions about “poisoning the well” even if one believes that the cultural and economic incentives disproportionately elevate such points).
In the context of AI safety, it seems to me like the most high-influence Overton Window moves have been positive– and in fact I would go as far as to say strongly positive. Examples that come to mind include the CAIS statement, FLI pause letter, Hinton leaving Google, Bengio’s writings/speeches about rogue AI & loss of control, Ian Hogarth’s piece about the race to god-like AI, and even Yudkowsky’s TIME article.
I think some of our judgments here depend on underlying threat models and an underlying sense of optimism vs. pessimism. If one things that labs making voluntary agreements/promises and NIST contributing to the development of voluntary standards are quite excellent ways to reduce AI risk, then the groups that have helped make this happen deserve a lot of credit. If one thinks that much more is needed to meaningfully reduce xrisk, then the groups that are raising awareness about the nature of the problem, making high-quality arguments about threat models, and advocating for stronger policies deserve a lot of credit.
I agree that more research on this could be useful. But I think it would be most valuable to focus less on “is X in the Overton Window” and more on “is X written/explained well and does it seem to have clear implications for the target stakeholders?”
Re: how over-emphasis on “how radical is my ask” vs “what my target audience might find helpful” and generally the importance of making your case well regardless of how radical it is, that makes sense. Though notably the more radical your proposal is (or more unfamiliar your threat models are), the higher the bar for explaining it well, so these do seem related.
Re: more effective actors looking for small wins, I agree that it’s not clear, but yeah, seems like we are likely to get into some reference class tennis here. “A lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate”? Maybe, but I think of like, the agriculture lobby, who just sort of quietly make friends with everybody and keep getting 11-figure subsidies every year, in a way that (I think) resulted more from gradual ratcheting than making a huge ask. “Pretty much no group– whether radical or centrist– has had tangible wins” seems wrong in light of the EU AI Act (where I think both a “radical” FLI and a bunch of non-radical orgs were probably important) and the US executive order (I’m not sure which strategy is best credited there, but I think most people would have counted the policies contained within it as “minor asks” relative to licensing, pausing, etc). But yeah I agree that there are groups along the whole spectrum that probably deserve credit.
Re: poisoning the well, again, radical-ness and being dumb/uninformed are of course separable but the bar rises the more radical you get, in part because more radical policy asks strongly correlate with more complicated procedural asks; tweaking ECRA is both non-radical and procedurally simple, creating a new agency to license training runs is both outside the DC Overton Window and very procedurally complicated.
Re: incentives, I agree that this is a good thing to track, but like, “people who oppose X are incentivized to downplay the reasons to do X” is just a fully general counterargument. Unless you’re talking about financial conflicts of interest, but there are also financial incentives for orgs pursuing a “radical” strategy to downplay boring real-world constraints, as well as social incentives (e.g. on LessWrong IMO) to downplay boring these constraints and cognitive biases against thinking your preferred strategy has big downsides.
I agree that the CAIS statement, Hinton leaving Google, and Bengio and Hogarth’s writing have been great. I think that these are all in a highly distinct category from proposing specific actors take specific radical actions (unless I’m misremembering the Hogarth piece). Yudkowsky’s TIME article, on the other hand, definitely counts as an Overton Window move, and I’m surprised that you think this has had net positive effects. I regularly hear “bombing datacenters” as an example of a clearly extreme policy idea, sometimes in a context that sounds like it maybe made the less-radical idea seem more reasonable, but sometimes as evidence that the “doomers” want to do crazy things and we shouldn’t listen to them, and often as evidence that they are at least socially clumsy, don’t understand how politics works, etc, which is related to the things you list as the stuff that actually poisons the well. (I’m confused about the sign of the FLI letter as we’ve discussed.)
I’m not sure optimism vs pessimism is a crux, except in very short, like, 3-year timelines. It’s true that optimists are more likely to value small wins, so I guess narrowly I agree that a ratchet strategy looks strictly better for optimists, but if you think big radical changes are needed, the question remains of whether you’re more likely to get there via asking for the radical change now or looking for smaller wins to build on over time. If there simply isn’t time to build on these wins, then yes, better to take a 2% shot at the policy that you actually think will work; but even in 5-year timelines I think you’re better positioned to get what you ultimately want by 2029 if you get a little bit of what you want in 2024 and 2026 (ideally while other groups also make clear cases for the threat models and develop the policy asks, etc.). Another piece this overlooks is the information and infrastructure built by the minor policy changes. A big part of the argument for the reporting requirements in the EO was that there is now going to be an office in the US government that is in the business of collecting critical information about frontier AI models and figuring out how to synthesize it to the rest of government, that has the legal authority to do this, and both the office and the legal authority can now be expanded rather than created, and there will now be lots of individuals who are experienced in dealing with this information in the government context, and it will seem natural that the government should know this information. I think if we had only been developing and advocating for ideal policy, this would not have happened (though I imagine that this is not in fact what you’re suggesting the community do!).
Unless you’re talking about financial conflicts of interest, but there are also financial incentives for orgs pursuing a “radical” strategy to downplay boring real-world constraints, as well as social incentives (e.g. on LessWrong IMO) to downplay boring these constraints and cognitive biases against thinking your preferred strategy has big downsides.
It’s not just that problem though, they will likely be biased to think that their policy is helpful for safety of AI at all, and this is a point that sometimes gets forgotten.
But correct on the fact that Akash’s argument is fully general.
Ingroup losing status? Few things are more prone to distorted perception than that.
And I think this makes sense (e.g. Simler’s Social Status: Down the Rabbit Hole which you’ve probably read), if you define “AI Safety” as “people who think that superintelligence is serious business or will be some day”.
The psych dynamic that I find helpful to point out here is Yud’s Is That Your True Rejection post from ~16 years ago. A person who hears about superintelligence for the first time will often respond to their double-take at the concept by spamming random justifications for why that’s not a problem (which, notably, feels like legitimate reasoning to that person, even though it’s not). An AI-safety-minded person becomes wary of being effectively attacked by high-status people immediately turning into what is basically a weaponized justification machine, and develops a deep drive wanting that not to happen. Then justifications ensue for wanting that to happen less frequently in the world, because deep down humans really don’t want their social status to be put at risk (via denunciation) on a regular basis like that. These sorts of deep drives are pretty opaque to us humans but their real world consequences are very strong.
Something that seems more helpful than playing whack-a-mole whenever this issue comes up is having more people in AI policy putting more time into improving perspective. I don’t see shorter paths to increasing the number of people-prepared-to-handle-unexpected-complexity than giving people a broader and more general thinking capacity for thoughtfully reacting to the sorts of complex curveballs that you get in the real world. Rationalist fiction like HPMOR is great for this, as well as others e.g. Three Worlds Collide, Unsong, Worth the Candle, Worm (list of top rated ones here). With the caveat, of course, that doing well in the real world is less like the bite-sized easy-to-understand events in ratfic, and more like spotting errors in the methodology section of a study or making money playing poker.
I think, given the circumstances, it’s plausibly very valuable e.g. for people already spending much of their free time on social media or watching stuff like The Office, Garfield reruns, WWI and Cold War documentaries, etc, to only spend ~90% as much time doing that and refocusing ~10% to ratfic instead, and maybe see if they can find it in themselves to want to shift more of their leisure time to that sort of passive/ambient/automatic self-improvement productivity.
These are plausible concerns, but I don’t think they match what I see as a longtime DC person.
We know that the legislative branch is less productive in the US than it has been in any modern period, and fewer bills get passed (many different metrics for this, but one is https://www.reuters.com/graphics/USA-CONGRESS/PRODUCTIVITY/egpbabmkwvq/) . Those bills that do get passed tend to be bigger swings as a result—either a) transformative legislation (e.g., Obamacare, Trump tax cuts and COVID super-relief, Biden Inflation Reduction Act and CHIPS) or b) big omnibus “must-pass” bills like FAA reauthorization, into which many small proposals get added in.
I also disagree with the claim that policymakers focus on credibility and consensus generally, except perhaps in the executive branch to some degree. (You want many executive actions to be noncontroversial “faithfully executing the laws” stuff, but I don’t see that as “policymaking” in the sense you describe it.)
In either of those, it seems like the current legislative “meta” favors bigger policy asks, not small wins, and I’m having trouble of thinking of anyone I know who’s impactful in DC who has adopted the opposite strategy. What are examples of the small wins that you’re thinking of as being the current meta?
That likely includes either directly or indirectly the Chinese government.
What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.
If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you don’t want the Chinese government (or for that matter anyone with a decent amount of capital) to know.
I don’t have confidence in my models of how coherent and competent governments are at getting and using data like this. The primary buyers of location data are advertisers and business planners looking for statistical correlations for targeting and decisions. This is creepy, but not directly comparable to “targeted by the Chinese government”.
My competing theories of “targeted by the Chinese government” threats are:
they’re hyper-competent and have employee/agents at most carriers who will exfiltrate needed data, so stopping the explicit sale just means it’s less visible.
they’re as bureaucratic and confused as everything else, so even if they know where you are, they’re unable to really do much with it.
I think the tension is what does it even mean to be targeted by a government.
I don’t have confidence in my models of how coherent and competent governments are at getting and using data like this.
The Office of the Director of National Intelligence wrote a report about this question that was declassified last year. They use the abbreviation CAI for “commercially accessible data”.
“2.5. (U) Counter-Intelligence Risks in CAI. There is also a growing recognition that CAI, as a generally available resource, offers intelligence benefits to our adversaries, some of which may create counter-intelligence risk for the IC. For example, the January 2021 CSIS report cited above also urges the IC to “test and demonstrate the utility of OSINT and AI in analysis on critical threats, such as the adversary use of AI-enabled capabilities in disinformation and influence operations.”
Last month there was a political fight about warrant requirements when the US intelligence agencies use commercially brought data, that was likely partly caused by the concerns from that report.
I think the tension is what does it even mean to be targeted by a government.
Here, I mean that you are doing something that’s of interest to Chinese intelligence services. People who want to lobby for Chinese AI policy probably fall under that class.
I’m not sure to what extent people working at top AI labs might be blackmailed by the Chinese government to do things like give them their source code.
[note: I suspect we mostly agree on the impropriety of open selling and dissemination of this data. This is a narrow objection to the IMO hyperbolic focus on government assault risks. ]
I’m unhappy with the phrasing of “targeted by the Chinese government”, which IMO implies violence or other real-world interventions when the major threats are “adversary use of AI-enabled capabilities in disinformation and influence operations.” Thanks for mentioning blackmail—that IS a risk I put in the first category, and presumably becomes more possible with phone location data. I don’t know how much it matters, but there is probably a margin where it does.
I don’t disagree that this purchasable data makes advertising much more effective (in fact, I worked at a company based on this for some time). I only mean to say that “targeting” in the sense of disinformation campaigns is a very different level of threat from “targeting” of individuals for government ops.
This is a narrow objection to the IMO hyperbolic focus on government assault risks.
Whether or not you face government assault risks depends on what you do. Most people don’t face government assault risks. Some people engage in work or activism that results in them having government assault risks.
The Chinese government has strategic goals and most people are unimportant to those. Some people however work on topics like AI policy in which the Chinese government has an interest.
I feel like this comparison of the enforcement here with the TikTok ban is not directed at the actual primary concern about TikTok, which is content curation by its opaque algorithm, not data privacy per se.
By analogy, if a Soviet state-owned enterprise in 1980 wanted to purchase NBC, would/should we have allowed that? If your answer is “no,” keeping in mind how many people get their news via TikTok, why would/should we allow what effectively seems to be a CCP-(owned or heavily influenced) company to control what content our people see?
Politico wrote, “Perhaps the most pressing concern is around the Chinese government’s potential access to troves of data from TikTok’s millions of users.” The concern that TikTok supposedly is spyware is frequently made in discussions about why it should be banned.
If the main issue is content moderation decisions, the best way to deal with it would be to legislate transparency around content moderation decisions and require TikTok to outsource the moderation decisions to some US contractor.
Claude learns across different chats. What does this mean?
I was asking Claude 3 Sonnet “what is a PPU” in the context of this thread. For that purpose, I pasted part of the thread.
Claude automatically assumed that OA meant Anthropic (instead of OpenAI), which was surprising.
I opened a new chat, copying the exact same text, but with OA replaced by GDM. Even then, Claude assumed GDM meant Anthropic (instead of Google DeepMind).
This seemed like interesting behavior, so I started toying around (in new chats) with more tweaks to the prompt to check its robustness. But from then on Claude always correctly assumed OA was OpenAI, and GDM was Google DeepMind.
In fact, even when copying in a new chat the exact same original prompt (which elicited Claude to take OA to be Anthropic), the mistake no longer happened. Neither when I went for a lot of retries, nor tried the same thing in many different new chats.
Does this mean Claude somehow learns across different chats (inside the same user account)?
If so, this might not happen through a process as naive as “append previous chats as the start of the prompt, with a certain indicator that they are different”, but instead some more effective distillation of the important information from those chats.
Do we have any information on whether and how this happens?
(A different hypothesis is not that the later queries had access to the information from the previous ones, but rather that they were for some reason “more intelligent” and were able to catch up to the real meanings of OA and GDM, where the previous queries were not. This seems way less likely.)
I’ve checked for cross-chat memory explicitly (telling it to remember some information in one chat, and asking about it in the other), and it acts is if it doesn’t have it.
Claude also explicitly states it doesn’t have cross-chat memory, when asked about it.
Might something happen like “it does have some chat memory, but it’s told not to acknowledge this fact, but it sometimes slips”?
Probably more nuanced experiments are in order. Although note maybe this only happens for the chat webapp, and not different ways to access the API.
I worked at OpenAI for three years, from 2021-2024 on the Alignment team, which eventually became the Superalignment team. I worked on scalable oversight, part of the team developing critiques as a technique for using language models to spot mistakes in other language models. I then worked to refine an idea from Nick Cammarata into a method for using language model to generate explanations for features in language models. I was then promoted to managing a team of 4 people which worked on trying to understand language model features in context, leading to the release of an open source “transformer debugger” tool.
I resigned from OpenAI on February 15, 2024.
What are your timelines like? How long do YOU think we have left?
I know several CEOs of small AGI startups who seem to have gone crazy and told me that they are self inserts into this world, which is a simulation of their original self’s creation. However, none of them talk about each other, and presumably at most one of them can be meaningfully right?
One AGI CEO hasn’t gone THAT crazy (yet), but is quite sure that the November 2024 election will be meaningless because pivotal acts will have already occurred that make nation state elections visibly pointless.
Also I know many normies who can’t really think probabilistically and mostly aren’t worried at all about any of this… but one normy who can calculate is pretty sure that we have AT LEAST 12 years (possibly because his retirement plans won’t be finalized until then). He also thinks that even systems as “mere” as TikTok will be banned before the November 2024 election because “elites aren’t stupid”.
I think I’m more likely to be better calibrated than any of these opinions, because most of them don’t seem to focus very much on “hedging” or “thoughtful doubting”, whereas my event space assigns non-zero probability to ensembles that contain such features of possible futures (including these specific scenarios).
Wondering why this has so many disagreement votes. Perhaps people don’t like to see the serious topic of “how much time do we have left”, alongside evidence that there’s a population of AI entrepreneurs who are so far removed from consensus reality, that they now think they’re living in a simulation.
I assume timelines are fairly long or this isn’t safety related. I don’t see a point in keeping PPUs or even caring about NDA lawsuits which may or may not happen and would take years in a short timeline or doomed world.
I think having a probability distribution over timelines is the correct approach. Like, in the comment above:
Even in probabilistic terms, the evidence of OpenAI members respecting their NDAs makes it more likely that this was some sort of political infighting (EA related) than sub-year takeoff timelines. I would be open to a 1 year takeoff, I just don’t see it happening given the evidence. OpenAI wouldn’t need to talk about raising trillions of dollars, companies wouldn’t be trying to commoditize their products, and the employees who quit OpenAI would speak up.
Political infighting is in general just more likely than very short timelines, which would go in counter of most prediction markets on the matter. Not to mention, given it’s already happened with the firing of Sam Altman, it’s far more likely to have happened again.
If there was a probability distribution of timelines, the current events indicate sub 3 year ones have negligible odds. If I am wrong about this, I implore the OpenAI employees to speak up. I don’t think normies misunderstand probability distributions, they just usually tend not to care about unlikely events.
No, OpenAI (assuming that it is a well-defined entity) also uses a probability distribution over timelines.
(In reality, every member of its leadership has its own probability distribution, and this translates to OpenAI having a policy and behavior formulated approximately as if there is some resulting single probability distribution).
The important thing is, they are uncertain about timelines themselves (in part, because no one knows how perplexity translates to capabilities, in part, because there might be difference with respect to capabilities even with the same perplexity, if the underlying architectures are different (e.g. in-context learning might depend on architecture even with fixed perplexity, and we do see a stream of potentially very interesting architectural innovations recently), in part, because it’s not clear how big is the potential of “harness”/”scaffolding”, and so on).
This does not mean there is no political infighting. But it’s on the background of them being correctly uncertain about true timelines...
Compute-wise, inference demands are huge and growing with popularity of the models (look how much Facebook did to make LLama 3 more inference-efficient).
So if they expect models to become useful enough for almost everyone to want to use them, they should worry about compute, assuming they do want to serve people like they say they do (I am not sure how this looks for very strong AI systems; they will probably be gradually expanding access, and the speed of expansion might depend).
Why at most one of them can be meaningfully right?
Would not a simulation typically be “a multi-player game”?
(But yes, if they assume that their “original self” was the sole creator (?), then they would all be some kind of “clones” of that particular “original self”. Which would surely increase the overall weirdness.)
Thank you for your work there. Curious what specifically prompted you to post this now, presumably you leaving OpenAI and wanting to communicate that somehow?
No comment.
Can you confirm or deny whether you signed any NDA related to you leaving OpenAI?
(I would guess a “no comment” or lack of response or something to that degree implies a “yes” with reasonably high probability. Also, you might be interested in this link about the U.S. labor board deciding that NDA’s offered during severance agreements that cover the existence of the NDA itself have been ruled unlawful by the National Labor Relations Board when deciding how to respond here)
(not a lawyer)
My layman’s understanding is that managerial employees are excluded from that ruling, unfortunately. Which I think applies to William_S if I read his comment correctly. (See Pg 11, in the “Excluded” section in the linked pdf in your link)
I think it is safe to infer from the conspicuous and repeated silence by ex-OA employees when asked whether they signed a NDA which also included a gag order about the NDA, that there is in fact an NDA with a gag order in it, presumably tied to the OA LLC PPUs (which are not real equity and so probably even less protected than usual).
What’s PPU?
Daniel K seems pretty open about his opinions and reasons for leaving. Did he not sign an NDA and thus gave up whatever PPUs he had?
When I spoke to him a few weeks ago (a week after he left OAI), he had not signed an NDA at that point, so it seems likely that he hasn’t.
Does anyone know if it’s typically the case that people under gag orders about their NDAs can talk to other people who they know signed the same NDAs? That is, if a bunch of people quit a company and all have signed self-silencing NDAs, are they normally allowed to talk to each other about why they quit and commiserate about the costs of their silence?
You should update by +-1% on AI doom surprisingly frequently
This is just a fact about how stochastic processes work. If your p(doom) is Brownian motion in 1% steps starting at 50% and stopping once it reaches 0 or 1, then there will be about 50^2=2500 steps of size 1%. This is a lot! If we get all the evidence for whether humanity survives or not uniformly over the next 10 years, then you should make a 1% update 4-5 times per week. In practice there won’t be as many due to heavy-tailedness in the distribution and the fact you don’t start at 50%. But I do believe that evidence is coming in every week such that ideal market prices should move by 1% on maybe half of weeks, and it is not crazy for your probabilities to shift by 1% during many weeks if you think about it.
Interesting...
Wouldn’t I expect the evidence to come out in a few big chunks, e.g. OpenAI releasing a new product?
I seriously doubt on priors that Boeing corporate is murdering employees.
Does anyone have any takes on the two Boeing whistleblowers who died under somewhat suspicious circumstances? I haven’t followed this in detail, and my guess is it is basically just random chance, but it sure would be a huge deal if a publicly traded company now was performing assassinations of U.S. citizens.
Curious whether anyone has looked into this, or has thought much about baseline risk of assassinations or other forms of violence from economic actors.
@jefftk comments on the HN thread on this:
Another HN commenter says (in a different thread):
I’m probably missing something simple, but what is 356? I was expecting a probability or a percent, but that number is neither.
I think 356 or more people in the population needed to make there be a >5% of 2+ deaths in a 2 month span from that population
Thoughtdump on why I’m interested in computational mechanics:
one concrete application to natural abstractions from here: tl;dr, belief structures generally seem to be fractal shaped. one major part of natural abstractions is trying to find the correspondence between structures in the environment and concepts used by the mind. so if we can do the inverse of what adam and paul did, i.e. ‘discover’ fractal structures from activations and figure out what stochastic process they might correspond to in the environment, that would be cool
… but i was initially interested in reading compmech stuff not with a particular alignment relevant thread in mind but rather because it seemed broadly similar in directions to natural abstractions.
re: how my focus would differ from my impression of current compmech work done in academia: academia seems faaaaaar less focused on actually trying out epsilon reconstruction in real world noisy data. CSSR is an example of a reconstruction algorithm. apparently people did compmech stuff on real-world data, don’t know how good, but effort-wise far too less invested compared to theory work
would be interested in these reconstruction algorithms, eg what are the bottlenecks to scaling them up, etc.
tangent: epsilon transducers seem cool. if the reconstruction algorithm is good, a prototypical example i’m thinking of is something like: pick some input-output region within a model, and literally try to discover the hmm model reconstructing it? of course it’s gonna be unwieldly large. but, to shift the thread in the direction of bright-eyed theorizing …
the foundational Calculi of Emergence paper talked about the possibility of hierarchical epsilon machines, where you do epsilon machines on top of epsilon machines and for simple examples where you can analytically do this, you get wild things like coming up with more and more compact representations of stochastic processes (eg data stream → tree → markov model → stack automata → … ?)
this … sounds like natural abstractions in its wildest dreams? literally point at some raw datastream and automatically build hierarchical abstractions that get more compact as you go up
haha but alas, (almost) no development afaik since the original paper. seems cool
and also more tangentially, compmech seemed to have a lot to talk about providing interesting semantics to various information measures aka True Names, so another angle i was interested in was to learn about them.
eg crutchfield talks a lot about developing a right notion of information flow—obvious usefulness in eg formalizing boundaries?
many other information measures from compmech with suggestive semantics—cryptic order? gauge information? synchronization order? check ruro1 and ruro2 for more.
I agree with you.
Epsilon machine (and MSP) construction is most likely computationally intractable [I don’t know an exact statement of such a result in the literature but I suspect it is true] for realistic scenarios.
Scaling an approximate version of epsilon reconstruction seems therefore of prime importance. Real world architectures and data has highly specific structure & symmetry that makes it different from completely generic HMMs. This must most likely be exploited.
The calculi of emergence paper has inspired many people but has not been developed much. Many of the details are somewhat obscure, vague. I also believe that most likely completely different methods are needed to push the program further. Computational Mechanics’ is primarily a theory of hidden markov models—it doesn’t have the tools to easily describe behaviour higher up the Chomsky hierarchy. I suspect more powerful and sophisticated algebraic, logical and categorical thinking will be needed here. I caveat this by saying that Paul Riechers has pointed out that actually one can understand all these gadgets up the Chomsky hierarchy as infinite HMMs which may be analyzed usefully just as finite HMMs.
The still-underdeveloped theory of epsilon transducers I regard as the most promising lens on agent foundations. This is uncharcted territory; I suspect the largest impact of computational mechanics will come from this direction.
Your point on True Names is well-taken. More basic examples than gauge information, synchronization order are the triple of quantites entropy rate h, excess entropy E and Crutchfield’s statistical/forecasting complexity C. These are the most important quantities to understand for any stochastic process (such as the structure of language and LLMs!)
Pithy sayings are lossily compressed.
Yes.
For example: The common saying, “Anything worth doing is worth doing [well/poorly]” needs more qualifiers. As it is, the opposite respective advice can often be just as useful. I.E. not very.
Better V1: “The cost/utility ratio of beneficial actions at minimum cost are often less favorable than they would be with greater investment.”
Better V2: “If an action is beneficial, a flawed attempt may be preferable to none at all.”
However, these are too wordy to be pithy and in pop culture transmission accuracy is generally sacrificed in favor of catchiness.
[epistemic status: I think I’m mostly right about the main thrust here, but probably some of the specific arguments below are wrong. In the following, I’m much more stating conclusions than providing full arguments. This claim isn’t particularly original to me.]
I’m interested in the following subset of risk from AI:
Early: That comes from AIs that are just powerful enough to be extremely useful and dangerous-by-default (i.e. these AIs aren’t wildly superhuman).
Scheming: Risk associated with loss of control to AIs that arises from AIs scheming
So e.g. I exclude state actors stealing weights in ways that aren’t enabled by the AIs scheming, and I also exclude non-scheming failure modes. IMO, state actors stealing weights is a serious threat, but non-scheming failure modes aren’t (at this level of capability and dignity).
Medium dignity: that is, developers of these AIs are putting a reasonable amount of effort into preventing catastrophic outcomes from their AIs (perhaps they’re spending the equivalent of 10% of their budget on cost-effective measures to prevent catastrophes).
Nearcasted: no substantial fundamental progress on AI safety techniques, no substantial changes in how AI works.
This subset of risk is interesting because I think it’s a natural scenario at which to target technical work on AI safety. (E.g. it’s the main scenario we’re targeting with our AI control agenda.)
I claim that the majority of this risk comes from the lab using AIs internally to do AI development (by which I mean both research and engineering). This is because the AIs doing AI development naturally require access to compute and model weights that they can potentially leverage into causing catastrophic outcomes—in particular, those resources can be abused to run AIs unmonitored.
Using AIs for AI development looks uniquely risky to me among applications of early-transformative AIs, because unlike all other applications I know about:
It’s very expensive to refrain from using AIs for this application.
There’s no simple way to remove affordances from the AI such that it’s very hard for the AI to take a small sequence of actions which plausibly lead quickly to loss of control. In contrast, most other applications of AI probably can be controlled just by restricting their affordances.
If I’m right that the risk from scheming early-transformative models is concentrated onto this pretty specific scenario, it implies a bunch of things:
It implies that work on mitigating these risks should focus on this very specific setting.
It implies that AI control is organizationally simpler, because most applications can be made trivially controlled.
It is pretty plausible to me that AI control is quite easy, because you actually can remove affordances from the AIs that are doing AI R&D such that it’s hard for them to cause problems.
Can you be more clearer this point? To operationalize this, I propose the following question: what is the fraction of world GDP you expect will be attributable to AI at the time we have these risky AIs that you are interested in?
For example, are you worried about AIs that will arise when AI is 1-10% of the economy, or more like 50%? 90%?
One operationalization is “these AIs are capable of speeding up ML R&D by 30x with less than a 2x increase in marginal costs”.
As in, if you have a team doing ML research, you can make them 30x faster with only <2x increase in cost by going from not using your powerful AIs to using them.
With these caveats:
The speed up is relative to the current status quo as of GPT-4.
The speed up is ignoring the “speed up” of “having better experiments to do due to access to better models” (so e.g., they would complete a fixed research task faster).
By “capable” of speeding things up this much, I mean that if AIs “wanted” to speed up this task and if we didn’t have any safety precautions slowing things down, we could get these speedups. (Of course, AIs might actively and successfully slow down certain types of research and we might have burdensome safety precautions.)
The 2x increase in marginal cost is ignoring potential inflation in the cost of compute (FLOP/$) and inflation in the cost of wages of ML researchers. Otherwise, I’m uncertain how exactly to model the situation. Maybe increase in wages and decrease in FLOP/$ cancel out? Idk.
It might be important that the speed up is amortized over a longer duration like 6 months to 1 year.
I’m uncertain what the economic impact of such systems will look like. I could imagine either massive (GDP has already grown >4x due to the total effects of AI) or only moderate (AIs haven’t yet been that widely deployed due to inference availability issues, so actual production hasn’t increased that much due to AI (<10%), though markets are pricing in AI being a really, really big deal).
So, it’s hard for me to predict the immediate impact on world GDP. After adaptation and broad deployment, systems of this level would likely have a massive effect on GDP.
I think it depends on how you’re defining an “AI control success”. If success is defined as “we have an early transformative system that does not instantly kill us– we are able to get some value out of it”, then I agree that this seems relatively easy under the assumptions you articulated.
If success is defined as “we have an early transformative that does not instantly kill us and we have enough time, caution, and organizational adequacy to use that system in ways that get us out of an acute risk period”, then this seems much harder.
The classic race dynamic threat model seems relevant here: Suppose Lab A implements good control techniques on GPT-8, and then it’s trying very hard to get good alignment techniques out of GPT-8 to align a successor GPT-9. However, Lab B was only ~2 months behind, so Lab A feels like it needs to figure all of this out within 2 months. Lab B– either because it’s less cautious or because it feels like it needs to cut corners to catch up– either doesn’t want to implement the control techniques or it’s fine implementing the control techniques but it plans to be less cautious around when we’re ready to scale up to GPT-9.
I think it’s fine to say “the control agenda is valuable even if it doesn’t solve the whole problem, and yes other things will be needed to address race dynamics otherwise you will only be able to control GPT-8 for a small window of time before you are forced to scale up prematurely or hope that your competitor doesn’t cause a catastrophe.” But this has a different vibe than “AI control is quite easy”, even if that statement is technically correct.
(Also, please do point out if there’s some way in which the control agenda “solves” or circumvents this threat model– apologies if you or Ryan has written/spoken about it somewhere that I missed.)
When I said “AI control is easy”, I meant “AI control mitigates most risk arising from human-ish-level schemers directly causing catastrophes”; I wasn’t trying to comment more generally. I agree with your concern.
I didn’t get this from the premises fwiw. Are you saying it’s trivial because “just don’t use your AI to help you design AI” (seems organizationally hard to me), or did you have particular tricks in mind?
The claim is that most applications aren’t internal usage of AI for AI development and thus can be made trivially safe.
Not that most applications of AI for AI development can be made trivially safe.
Pain is the consequence of a perceived reduction in the probability that an agent will achieve its goals.
In biological organisms, physical pain [say, in response to limb being removed] is an evolutionary consequence of the fact that organisms with the capacity to feel physical pain avoided situations where their long-term goals [e.g. locomotion to a favourable position with the limb] which required the subsystem generating pain were harmed.
This definition applies equally to mental pain [say, the pain felt when being expelled from a group of allies] which impedes long term goals.
This suggests that any system that possesses both a set of goals and the capacity to understand how events influence their probability of achieving such goals should posses a capacity to feel pain. This also suggests that the amount of pain is proportional to the degree of “setbacks” and the degree to which “setbacks” are perceived.
I think this is a relatively robust argument for the inherent reality of pain not just in a broad spectrum biological organisms, but also in synthetic [including sufficiently advanced AI] agents.
We should strive to reduce the pain we cause in the agents we interact with.
I think pain is a little bit different than that. It’s the contrast between the current state and the goal state. This constrast motivates the agent to act, when the pain of contrast becomes bigger than the (predicted) pain of acting.
As a human, you can decrase your pain by thinking that everything will be okay, or you can increase your pain by doubting the process. But it is unlikely that you will allow yourself to stop hurting, because your brain fears that a lack of suffering would result in a lack of progress (some wise people contest this, claiming that wu wei is correct).
Another way you can increase your pain is by focusing more on the goal you want to achieve, sort of irritating/torturing yourself with the fact that the goal isn’t achieved, to which your brain will respond by increasing the pain felt by the contrast, urging action.
Do you see how this differs slightly from your definition? Chronic pain is not a continuous reduction in agency, but a continuous contrast between a bad state and a good state, which makes one feel pain which motivates them to solve it (exercise, surgery, resting, looking for painkillers, etc). This generalizes to other negative feelings, for instance to hunger, which exists with the purpose to be less pleasant than the search for food is, such that you seek food.
I warn you that avoiding negative emotions can lead to stagnation, since suffering leads to growth (unless we start wireheading, and making the avoidance of pain our new goal, because then we might seek hedonic pleasures and intoxicants)
I would certainly agree with part of what you are saying. Especially the point that many important lessons are taught by pain [correct me if this is misinterpreting your comment]. Indeed, as a parent for example, if your goal is for your child to gain the capacity for self sufficiency, a certain amount of painful lessons that reflect the inherent properties of the world are necessary to achieve such a goal.
On the other hand, I do not agree with your framing of pain as being the main motivator [again, correct me if required]. In fact, a wide variety of systems in the brain are concerned with calculating and granting rewards. Perhaps pain and pleasure are the two sides of the same coin, and reward maximisation and regret minimisation are identical. In practice however, I think they often lead to different solutions.
I also do not agree with your interpretation that chronic pain does not reduce agency. For family members of mine suffering from arthritis, their chronic pain renders them unable to do many basic activities, for example, access areas for which you would need to climb stairs. I would like to emphasise that it is not the disease which limits their “degrees of freedom” [at least in the short term], and were they to take a large amount of painkillers, they could temporarily climb stairs again.
Finally, I would suggest that your framing as a “contrast between the current state and the goal state” is basically an alternative way of talking about the transition probability from the current state to the goal state. In my opinion, this suggests that our conceptualisations of pain are overwhelmingly similar.
I think all criticism, all shaming, all guilt tripping, all punishments and rewards directed at children—is for the purpose of driving them to do certain actions. If your children do what you think is right, there’s no need to do much of anything.
A more general and correct statement would be “Pain is for the sake of change, and all change is painful”. But that change is for the sake of actions. I don’t think that’s too much of a simplification to be useful.
I think regret, too, is connected here. And there’s certainly times when it seems like pain is the problem rather than an attempt to solve it, but I think that’s a misunderstanding. And while chronic pain does reduce agency, it’s a constant pain and a constant reduction of agency (not cumulative). The pain persists until the problem is solved, even if the problem does not get worse. So it’s the body telling the brain “Hey, do something about this, the importance is 50 units of pain”, then you will do anything to solve it as long there’s a path with less than 50 units of pain which leads to a solution.
The pain does limit agency, but not because it’s a real limitation. It’s an artificial one that the body creates to prevent you from damaging yourself. So all important agency is still possible. If the estimated consequences of avoiding the task is more painful than doing the task, you do it. But it’s again the body is just estimating the cost/benefit of tasks and choosing the optimal action by making it the least painful action.
My explanation and yours are almost identical, but there’s some important differences. In my view,
suffering is good, not bad. I really don’t want humanity to misunderstand this one fact, it has already had profound negative consequences. It’s phantom damage created to avoid real damage. An agent which is unable to feel physical pain and exhaustion would destroy itself, therefore physical pain and exhaustion are valuable and not problems to be solved. Emotions like suffering, exhaustion, annoyance, etc. function the same as physical pain, and once they get over a certain threshold they coerce you into taking an action. Physical pain comes from nerves, but emotional pain comes from your interpretation of reality. Your brain relies on you to tell what ought to be painful (so if you overestimate risk, it just believes you). And you don’t get to choose all your goals yourself, your brain wants you to fulfill your needs (prioritized by the hierarchy of needs). In short, the brain makes inaction painful, while keeping actions that it deems risky painful, and then messes with the weights/thresholds according to need. Just like with hunger (not eating is painful, but if all you have is stale or even moldy bread, then you need to be very hungry before you eat, and you will eat iff pain(hunger)>pain(eating the bread)).
An increase in power/agency feels a lot like happiness though, even according to Nietzsche who I’m not confident to argue against, so I get why you’d basically think that opposite of happiness is the opposite of agency (sorry if this summary does injustice to your point)
How many organisms other than humans have “long term goals”? Doesn’t that require a complex capacity for mental representation of possible future states?
Am I wrong in assuming that the capacity to experience “pain” is independent of an explicit awareness of what possibilities have been shifted as a result of the new sensory data? (i.e. having a limb cleaved from the rest of the body, stubbing your toe in the dark). The organism may not even be aware of those possibilities, only ‘aware’ of pain.
Note: I’m probably just having a fear of this sounding all too teleological and personifying evolution
It also suggests that there might some sort of conservation law for pain for agents.
Conservation of Pain if you will
Check my math: how does Enovid compare to to humming?
Nitric Oxide is an antimicrobial and immune booster. Normal nasal nitric oxide is 0.14ppm for women and 0.18ppm for men (sinus levels are 100x higher). journals.sagepub.com/doi/pdf/10.117…
Enovid is a nasal spray that produces NO. I had the damndest time quantifying Enovid, but this trial registration says 0.11ppm NO/hour. They deliver every 8h and I think that dose is amortized, so the true dose is 0.88. But maybe it’s more complicated. I’ve got an email out to the PI but am not hopeful about a response clinicaltrials.gov/study/NCT05109…
so Enovid increases nasal NO levels somewhere between 75% and 600% compared to baseline- not shabby. Except humming increases nasal NO levels by 1500-2000%. atsjournals.org/doi/pdf/10.116….
Enovid stings and humming doesn’t, so it seems like Enovid should have the larger dose. But the spray doesn’t contain NO itself, but compounds that react to form NO. Maybe that’s where the sting comes from? Cystic fibrosis and burn patients are sometimes given stratospheric levels of NO for hours or days; if the burn from Envoid came from the NO itself than those patients would be in agony.
I’m not finding any data on humming and respiratory infections. Google scholar gives me information on CF and COPD, @Elicit brought me a bunch of studies about honey.
With better keywords google scholar to bring me a bunch of descriptions of yogic breathing with no empirical backing.
There are some very circumstantial studies on illness in mouth breathers vs. nasal, but that design has too many confounders for me to take seriously.
Where I’m most likely wrong:
misinterpreted the dosage in the RCT
dosage in RCT is lower than in Enovid
Enovid’s dose per spray is 0.5ml, so pretty close to the new study. But it recommends two sprays per nostril, so real dose is 2x that. Which is still not quite as powerful as a single hum.
I found the gotcha: envoid has two other mechanisms of action. Someone pointed this out to me on my previous nitric oxide post, but it didn’t quite sink in till I did more reading.
What are the two other mechanisms of action?
citric acid and a polymer
Enovid is also adding NO to the body, whereas humming is pulling it from the sinuses, right? (based on a quick skim of the paper).
I found a consumer FeNO-measuring device for €550. I might be interested in contributing to a replication
I think that’s their guess but they don’t directly check here.
I also suspect that it doesn’t matter very much.
The sinuses have so much NO compared to the nose that this probably doesn’t materially lower sinus concentrations.
the power of humming goes down with each breath but is fully restored in 3 minutes, suggesting that whatever change happens in the sinsues is restored quickly
From my limited understanding of virology and immunology, alternating intensity of NO between sinuses and nose every three minutes is probably better than keeping sinus concentrations high[1]. The first second of NO does the most damage to microbes[2], so alternation isn’t that bad.
I’d love to test this. The device you linked works via the mouth, and we’d need something that works via the nose. From a quick google it does look like it’s the same test, so we’d just need a nasal adaptor.
Other options:
Nnoxx. Consumer skin device, meant for muscle measurements
There are lots of devices for measuring concentration in the air, maybe they could be repurporsed. Just breathing on it might be enough for useful relative metrics, even if they’re low-precision.
I’m also going to try to talk my asthma specialist into letting me use their oral machine to test my nose under multiple circumstances, but it seems unlikely she’ll go for it.
obvious question: so why didn’t evolution do that? Ancestral environment didn’t have nearly this disease (or pollution) load. This doesn’t mean I’m right but it means I’m discounting that specific evolutionary argument.
although NO is also an immune system signal molecule, so the average does matter.
I suspect that in practice many people use the word “prioritize” to mean:
think short-term
only do legible things
remove slack
Mathematical descriptions are powerful because they can be very terse. You can only specify the properties of a system and still get a well-defined system.
This is in contrast to writing algorithms and data structures where you need to get concrete implementations of the algorithms and data structures to get a full description.
“Mathematical descriptions” is a little ambiguous. Equations and models are terse. The mapping of such equations to human-level system expectations (anticipated conditional experiences) can require quite a bit of verbosity.
I think that’s what you’re saying with the “algorithms and data structures” part, but I’m unsure if you’re claiming that the property specification of the math is sufficient as a description, and comparable in fidelity to the algorithmic implementation.
The Model-View-Controller architecture is very powerful. It allows us to separate concerns.
For example, if we want to implement an algorithm, we can write down only the data structures and algorithms that are used.
We might want to visualize the steps that the algorithm is performing, but this can be separated from the actual running of the algorithm.
If the algorithm is interactive, then instead of putting the interaction logic in the algorithm, which could be thought of as the rules of the world, we instead implement functionality that directly changes the underlying data that the original algorithm is working on. These could be parameters to the original algorithm, which would modify the runtime behavior (e.g. we could change the maximum search depth for BFS). It could also change the current data the algorithm is working on (e.g. in quicksort we could change the pivot, or smaller_than_list just before they are set). The distinction is somewhat arbitrary. If we were to step through some Python code with a debugger, we could just set any variables in the program.
Usually, people think of something much “narrower” when they think about the Model-View-Controller-Architecture.
We could also do the same for a mathematical description. We can write down some mathematically well-defined thing and then separately think about how we can visualize this thing. And then again, separately, we can think about how would we interact with this thing.
A very rough draft of a plan to test prophylactics for airborne illnesses.
Start with a potential superspreader event. My ideal is a large conference, many of whom travelled to get there, in enclosed spaces with poor ventilation and air purification, in winter. Ideally >=4 days, so that people infected on day one are infectious while the conference is still running.
Call for sign-ups for testing ahead of time (disclosing all possible substances and side effects). Split volunteers into control and test group. I think you need ~500 sign ups in the winter to make this work.
Splitting controls is probably the hardest part. You’d like the control and treatment group to be identical, but there are a lot of things that affect susceptibility. Age, local vs. air travel, small children vs. not, sleep habits… it’s hard to draw the line
Make it logistically trivial to use the treatment. If it’s lozenges or liquids, put individually packed dosages in every bathroom, with a sign reminding people to use them (color code to direct people to the right basket). If it’s a nasal spray you will need to give everyone their own bottle, but make it trivial to get more if someone loses theirs.
Follow-up a week later, asking if people have gotten sick and when.
If the natural disease load is high enough this should give better data than any paper I’ve found.
Top contenders for this plan:
zinc lozenge
salt water gargle
enovid
betadine gargle
zinc gargle
This sounds like a bad plan because it will be a logistics nightmare (undermining randomization) with high attrition, and extremely high variance due to between-subject design (where subjects differ a ton at baseline, in addition to exposure) on a single occasion with uncontrolled exposures and huge measurement error where only the most extreme infections get reported (sometimes). You’ll probably get non-answers, if you finish at all. The most likely outcome is something goes wrong and the entire effort is wasted.
Since this is a topic which is highly repeatable within-person (and indeed, usually repeats often through a lifetime...), this would make more sense as within-individual and using higher-quality measurements.
One good QS approach would be to exploit the fact that infections, even asymptomatic ones, seem to affect heart rate etc as the body is damaged and begins fighting the infection. HR/HRV is now measurable off the shelf with things like the Apple Watch, AFAIK. So you could recruit a few tech-savvy conference-goers for measurements from a device they already own & wear. This avoids any ‘big bang’ and lets you prototype and tweak on a few people—possibly yourself? - before rolling it out, considerably de-risking it.
There are some people who travel constantly for business and going to conferences, and recruiting and managing a few of them would probably be infinitely easier than 500+ randos (if for no reason other than being frequent flyers they may be quite eager for some prophylactics), and you would probably get far more precise data out of them if they agree to cooperate for a year or so and you get eg 10 conferences/trips out of each of them which you can contrast with their year-round baseline & exposome and measure asymptomatic infections or just overall health/stress. (Remember, variance reduction yields exponential gains in precision or sample-size reduction. It wouldn’t be too hard for 5 or 10 people to beat a single 250vs250 one-off experiment, even if nothing whatsoever goes wrong in the latter. This is a case where a few hours writing simulations to do power analysis on could be very helpful. I bet that the ability to detect asymptomatic cases, and run within-person, will boost statistical power a lot more than you think compared to ad hoc questionnaires emailed afterwards which may go straight to spam...)
I wonder if you could also measure the viral load as a whole to proxy for the viral exposome through something like a tiny air filter, which can be mailed in for analysis, like the exposometer? Swap out the exposometer each trip and you can measure load as a covariate.
All of the problems you list seem harder with repeated within-person trials.
I don’t really know what people mean when they try to compare “capabilities advancements” to “safety advancements”. In one sense, its pretty clear. The common units are “amount of time”, so we should compare the marginal (probablistic) difference between time-to-alignment and time-to-doom. But I think practically people just look at vibes.
For example, if someone releases a new open source model people say that’s a capabilities advance, and should not have been done. Yet I think there’s a pretty good case that more well-trained open source models are better for time-to-alignment than for time-to-doom, since much alignment work ends up being done with them, and the marginal capabilities advance here is zero. Such work builds on the public state of the art, but not the private state of the art, which is probably far more advanced.
I also don’t often see people making estimates of the time-wise differential impacts here. Maybe people think such things would be exfo/info-hazardous, but nobody even claims to have estimates here when the topic comes up (even in private, though people are glad to talk about their hunches for what AI will look like in 5 years, or the types of advancements necessary for AGI), despite all the work on timelines. Its difficult to do this for the marginal advance, but not so much for larger research priorities, which are the sorts of things people should be focusing on anyway.
People who have the ability to clarify in any meaningful way will not do so. You are in a biased environment where people who are most willing to publish, because they are most able to convince themselves their research is safe—eg, because they don’t understand in detail how to reason about whether it is or not—are the ones who will do so. Ability to see far enough ahead would of course be expected to be rather rare, and most people who think they can tell the exact path ahead of time don’t have the evidence to back their hunches, even if their hunches are correct, which unless they have a demonstrated track record they probably aren’t. Therefore, whoever is making the most progress on real capabilities insights under the name of alignment will make their advancements and publish them, since they don’t personally see how it’s exfohaz. And it won’t be apparent until afterwards that it was capabilities, not alignment.
So just don’t publish anything, and do your work in private. Email it to anthropic when you know how to create a yellow node. But for god’s sake stop accidentally helping people create green nodes because you can’t see five inches ahead. And don’t send it to a capabilities team before it’s able to guarantee moral alignment hard enough to make a red-proof yellow node!
This seems contrary to how much of science works. I expect if people stopped talking publicly about what they’re working on in alignment, we’d make much less progress, and capabilities would basically run business as usual.
The sort of reasoning you use here, and that my only response to it basically amounts to “well, no I think you’re wrong. This proposal will slow down alignment too much” is why I think we need numbers to ground us.
Yeah, I agree that releasing open-weights non-frontier models doesn’t seem like a frontier capabilities advance. It does seem potentially like an open-source capabilities advance.
That can be bad in different ways. Let me pose a couple hypotheticals.
What if frontier models were already capable of causing grave harms to the world if used by bad actors, and it is only the fact that they are kept safety-fine-tuned and restricted behind APIs that is preventing this? In such a case, it’s a dangerous thing to have open-weight models catching up.
What if there is some threshold beyond which a model would be capable enough of recursive self-improvement with sufficient scaffolding and unwise pressure from an incautious user. Again, the frontier labs might well abstain from this course. Especially if they weren’t sure they could trust the new model design created by the current AI. They would likely move slowly and cautiously at least. I would not expect this of the open-source community. They seem focused on pushing the boundaries of agent-scaffolding and incautiously exploring the whatever they can.
So, as we get closer to danger, open-weight models take on more safety significance.
Yeah, there are reasons for caution. I think it makes sense for those concerned or non-concerned to make numerical forecasts about the costs & benefits of such questions, rather than the current state of everyone just comparing their vibes against each other. This generalizes to other questions, like the benefits of interpretability, advances in safety fine-tuning, deep learning science, and agent foundations.
Obviously such numbers aren’t the end-of-the-line, and like in biorisk, sometimes they themselves should be kept secret. But it still seems a great advance.
If anyone would like to collaborate on such a project, my DMs are open (not so say this topic is covered, this isn’t exactly my main wheelhouse).
it seems to me that disentangling beliefs and values are important part of being able to understand each other
and using words like “disagree” to mean both “different beliefs” and “different values” is really confusing in that regard
Lets use “disagree” vs “dislike”.
when potentially ambiguous, I generally just say something like “I have a different model” or “I have different values”
A semi-formalization of shard theory. I think that there is a surprisingly deep link between “the AIs which can be manipulated using steering vectors” and “policies which are made of shards.”[1] In particular, here is a candidate definition of a shard theoretic policy:
By this definition, humans have shards because they can want food at the same time as wanting to see their parents again, and both factors can affect their planning at the same time! The maze-solving policy is made of shards because we found activation directions for two motivational circuits (the cheese direction, and the top-right direction):
On the other hand, AIXI is not a shard theoretic agent because it does not have two motivational circuits which can be activated independently of each other. It’s just maximizing one utility function. A mesa optimizer with a single goal also does not have two motivational circuits which can go on and off in an independent fashion.
This definition also makes obvious the fact that “shards” are a matter of implementation, not of behavior.
It also captures the fact that “shard” definitions are somewhat subjective. In one moment, I might model someone is having a separate “ice cream shard” and “cookie shard”, but in another situation I might choose to model those two circuits as a larger “sweet food shard.”
So I think this captures something important. However, it leaves a few things to be desired:
What, exactly, is a “motivational circuit”? Obvious definitions seem to include every neural network with nonconstant outputs.
Demanding a compositional representation is unrealistic since it ignores superposition. If k dimensions are compositional, then they must be pairwise orthogonal. Then a transformer can only have k≤dmodel shards, which seems obviously wrong and false.
That said, I still find this definition useful.
I came up with this last summer, but never got around to posting it. Hopefully this is better than nothing.
Shard theory reasoning led me to discover the steering vector technique extremely quickly. This link would explain why shard theory might help discover such a technique.
I’m not so sure that shards should be thought of as a matter of implementation. Contextually activated circuits are a different kind of thing from utility function components. The former activate in certain states and bias you towards certain actions, whereas utility function components score outcomes. I think there are at least 3 important parts of this:
A shardful agent can be incoherent due to valuing different things from different states
A shardful agent can be incoherent due to its shards being shallow, caring about actions or proximal effects rather than their ultimate consequences
A shardful agent saves compute by not evaluating the whole utility function
The first two are behavioral. We can say an agent is likely to be shardful if it displays these types of incoherence but not others. Suppose an agent is dynamically inconsistent and we can identify features in the environment like cheese presence that cause its preferences to change, but mostly does not suffer from the Allais paradox, tends to spend resources on actions proportional to their importance for reaching a goal, and otherwise generally behaves rationally. Then we can hypothesize that the agent has some internal motivational structure which can be decomposed into shards. But exactly what motivational structure is very uncertain for humans and future agents. My guess is researchers need to observe models and form good definitions as they go along, and defining a shard agent as having compositionally represented motivators is premature. For now the most important thing is how steerable agents will be, and it is very plausible that we can manipulate motivational features without the features being anything like compositional.
Instead of demanding orthogonal representations, just have them obey the restricted isometry property.
Basically, instead of requiring ∀i≠j:<xi,xj>=0, we just require ∀i≠j:xi⋅xj≤ϵ .
This would allow a polynomial number of sparse shards while still allowing full recovery.
For illustration, what would be an example of having different shards for “I get food” (F) and “I see my parents again” (P) compared to having one utility distribution over F∧P, F∧¬P, ¬F∧P, ¬F∧¬P?
I think this is also what I was confused about—TurnTrout says that AIXI is not a shard-theoretic agent because it just has one utility function, but typically we imagine that the utility function itself decomposes into parts e.g. +10 utility for ice cream, +5 for cookies, etc. So the difference must not be about the decomposition into parts, but the possibility of independent activation? but what does that mean? Perhaps it means: The shards aren’t always applied, but rather only in some circumstances does the circuitry fire at all, and there are circumstances in which shard A fires without B and vice versa. (Whereas the utility function always adds up cookies and ice cream, even if there are no cookies and ice cream around?) I still feel like I don’t understand this.
@jessicata once wrote “Everyone wants to be a physicalist but no one wants to define physics”. I decided to check SEP article on physicalism and found that, yep, it doesn’t have definition of physics:
This surprised me because I have a definition of a physical theory and assumed that everyone else uses the same.
Perhaps my personal definition of physics is inspired by Engels’s “Dialectics of Nature”: “Motion is the mode of existence of matter.” Assuming “matter is described by physics,” we are getting “physics is the science that reduces studied phenomena to motion.” Or, to express it in a more analytical manner, “a physicalist theory is a theory that assumes that everything can be explained by reduction to characteristics of space and its evolution in time.”
For example, “vacuum” is a part of space with a “zero” value in all characteristics. A “particle” is a localized part of space with some non-zero characteristic. A “wave” is part of space with periodic changes of some characteristic in time and/or space. We can abstract away “part of space” from “particle” and start to talk about a particle as a separate entity, and speed of a particle is actually a derivative of spatial characteristic in time, and force is defined as the cause of acceleration, and mass is a measure of resistance to acceleration given the same force, and such-n-such charge is a cause of such-n-such force, and it all unfolds from the structure of various pure spatial characteristics in time.
The tricky part is, “Sure, we live in space and time, so everything that happens is some motion. How to separate physicalist theory from everything else?”
Let’s imagine that we have some kind of “vitalist field.” This field interacts with C, H, O, N atoms and also with molybdenum; it accelerates certain chemical reactions, and if you prepare an Oparin-Haldane soup and radiate it with vitalist particles, you will soon observe autocatalytic cycles resembling hypothetical primordial life. All living organisms utilize vitalist particles in their metabolic pathways, and if you somehow isolate them from an outside source of particles, they’ll die.
Despite having a “vitalist field,” such a world would be pretty much physicalist.
An unphysical vitalist world would look like this: if you have glowing rocks and a pile of organic matter, the organic matter is going to transform into mice. Or frogs. Or mosquitoes. Even if the glowing rocks have a constant glow and the composition of the organic matter is the same and the environment in a radius of a hundred miles is the same, nobody can predict from any observables which kind of complex life is going to emerge. It looks like the glowing rocks have their own will, unquantifiable by any kind of measurement.
The difference is that the “vitalist field” in the second case has its own dynamics not reducible to any spatial characteristics of the “vitalist field”; it has an “inner life.”
I think some of the AI safety policy community has over-indexed on the visual model of the “Overton Window” and under-indexed on alternatives like the “ratchet effect,” “poisoning the well,” “clown attacks,” and other models where proposing radical changes can make you, your allies, and your ideas look unreasonable (edit to add: whereas successfully proposing minor changes achieves hard-to-reverse progress, making ideal policy look more reasonable).
I’m not familiar with a lot of systematic empirical evidence on either side, but it seems to me like the more effective actors in the DC establishment overall are much more in the habit of looking for small wins that are both good in themselves and shrink the size of the ask for their ideal policy than of pushing for their ideal vision and then making concessions. Possibly an ideal ecosystem has both strategies, but it seems possible that at least some versions of “Overton Window-moving” strategies executed in practice have larger negative effects via associating their “side” with unreasonable-sounding ideas in the minds of very bandwidth-constrained policymakers, who strongly lean on signals of credibility and consensus when quickly evaluating policy options, than the positive effects of increasing the odds of ideal policy and improving the framing for non-ideal but pretty good policies.
In theory, the Overton Window model is just a description of what ideas are taken seriously, so it can indeed accommodate backfire effects where you argue for an idea “outside the window” and this actually makes the window narrower. But I think the visual imagery of “windows” actually struggles to accommodate this—when was the last time you tried to open a window and accidentally closed it instead? -- and as a result, people who rely on this model are more likely to underrate these kinds of consequences.
Would be interested in empirical evidence on this question (ideally actual studies from psych, political science, sociology, econ, etc literatures, rather than specific case studies due to reference class tennis type issues).
I’m not a decel, but the way this stuff often is resolved is that there are crazy people that aren’t taken seriously by the managerial class but that are very loud and make obnoxious asks. Think the evangelicals against abortion or the Columbia protestors.
Then there is some elite, part of the managerial class, that makes reasonable policy claims. For Abortion, this is Mitch McConnel, being disciplined over a long period of time in choosing the correct judges. For Palestine, this is Blinken and his State Department bureaucracy.
The problem with decels is that theoretically they are part of the managerial class themselves. Or at least, they act like they are. They call themselves rationalists, read Eliezer and Scott Alexander, and what not. But the problem is that it’s very hard for an uninterested third party to take seriously these Overton Window bogous claims from people that were supposed to be measured, part of the managerial class.
You need to split. There are the crazy ones that people don’t take seriously, but will move the managerial class. And there are the serious people that EA Money will send to D.C. to work at Blumenthal’s office. This person needs to make small policy requests that will sabotage IA, without looking so. And slowly, you get policy wins and you can sabotage the whole effort.
Agree with lots of this– a few misc thoughts [hastily written]:
I think the Overton Window frame ends up getting people to focus too much on the dimension “how radical is my ask”– in practice, things are usually much more complicated than this. In my opinion, a preferable frame is something like “who is my target audience and what might they find helpful.” If you’re talking to someone who makes it clear that they will not support X, it’s silly to keep on talking about X. But I think the “target audience first” approach ends up helping people reason in a more sophisticated way about what kinds of ideas are worth bringing up. As an example, in my experience so far, many policymakers are curious to learn more about intelligence explosion scenarios and misalignment scenarios (the more “radical” and “speculative” threat models).
I don’t think it’s clear that the more effective actors in DC tend to be those who look for small wins. Outside of the AIS community, there sure do seem to be a lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate. Whether or not these organizations end up having more or less influence than the more “centrist” groups is, in my view, not a settled question & probably varies a lot by domain. In AI safety in particular, I think my main claim is something like “pretty much no group– whether radical or centrist– has had tangible wins. When I look at the small set of tangible wins, it seems like the groups involved were across the spectrum of “reasonableness.”
The more I interact with policymakers, the more I’m updating toward something like “poisoning the well doesn’t come from having radical beliefs– poisoning the well comes from lamer things like being dumb or uninformed, wasting peoples’ time, not understanding how the political process works, not having tangible things you want someone to do, explaining ideas poorly, being rude or disrespectful, etc.” I’ve asked ~20-40 policymakers (outside of the AIS bubble) things like “what sorts of things annoy you about meetings” or “what tends to make meetings feel like a waste of your time”, and no one ever says “people come in with ideas that are too radical.” The closest thing I’ve heard is people saying that they dislike it when groups fail to understand why things aren’t able to happen (like, someone comes in thinking their idea is great, but then they fail to understand that their idea needs approval from committee A and appropriations person B and then they’re upset about why things are moving slowly). It seems to me like many policy folks (especially staffers and exec branch subject experts) are genuinely interested in learning more about the beliefs and worldviews that have been prematurely labeled as “radical” or “unreasonable” (or perhaps such labels were appropriate before chatGPT but no longer are).
A reminder that those who are opposed to regulation have strong incentives to make it seem like basically-any-regulation is radical/unreasonable. An extremely common tactic is for industry and its allies to make common-sense regulation seem radical/crazy/authoritarian & argue that actually the people proposing strong policies are just making everyone look bad & argue that actually we should all rally behind [insert thing that isn’t a real policy.] (I admit this argument is a bit general, and indeed I’ve made it before, so I won’t harp on it here. Also I don’t think this is what Trevor is doing– it is indeed possible to raise serious discussions about “poisoning the well” even if one believes that the cultural and economic incentives disproportionately elevate such points).
In the context of AI safety, it seems to me like the most high-influence Overton Window moves have been positive– and in fact I would go as far as to say strongly positive. Examples that come to mind include the CAIS statement, FLI pause letter, Hinton leaving Google, Bengio’s writings/speeches about rogue AI & loss of control, Ian Hogarth’s piece about the race to god-like AI, and even Yudkowsky’s TIME article.
I think some of our judgments here depend on underlying threat models and an underlying sense of optimism vs. pessimism. If one things that labs making voluntary agreements/promises and NIST contributing to the development of voluntary standards are quite excellent ways to reduce AI risk, then the groups that have helped make this happen deserve a lot of credit. If one thinks that much more is needed to meaningfully reduce xrisk, then the groups that are raising awareness about the nature of the problem, making high-quality arguments about threat models, and advocating for stronger policies deserve a lot of credit.
I agree that more research on this could be useful. But I think it would be most valuable to focus less on “is X in the Overton Window” and more on “is X written/explained well and does it seem to have clear implications for the target stakeholders?”
Quick reactions:
Re: how over-emphasis on “how radical is my ask” vs “what my target audience might find helpful” and generally the importance of making your case well regardless of how radical it is, that makes sense. Though notably the more radical your proposal is (or more unfamiliar your threat models are), the higher the bar for explaining it well, so these do seem related.
Re: more effective actors looking for small wins, I agree that it’s not clear, but yeah, seems like we are likely to get into some reference class tennis here. “A lot of successful organizations that take hard-line positions and (presumably) get a lot of their power/influence from the ideological purity that they possess & communicate”? Maybe, but I think of like, the agriculture lobby, who just sort of quietly make friends with everybody and keep getting 11-figure subsidies every year, in a way that (I think) resulted more from gradual ratcheting than making a huge ask. “Pretty much no group– whether radical or centrist– has had tangible wins” seems wrong in light of the EU AI Act (where I think both a “radical” FLI and a bunch of non-radical orgs were probably important) and the US executive order (I’m not sure which strategy is best credited there, but I think most people would have counted the policies contained within it as “minor asks” relative to licensing, pausing, etc). But yeah I agree that there are groups along the whole spectrum that probably deserve credit.
Re: poisoning the well, again, radical-ness and being dumb/uninformed are of course separable but the bar rises the more radical you get, in part because more radical policy asks strongly correlate with more complicated procedural asks; tweaking ECRA is both non-radical and procedurally simple, creating a new agency to license training runs is both outside the DC Overton Window and very procedurally complicated.
Re: incentives, I agree that this is a good thing to track, but like, “people who oppose X are incentivized to downplay the reasons to do X” is just a fully general counterargument. Unless you’re talking about financial conflicts of interest, but there are also financial incentives for orgs pursuing a “radical” strategy to downplay boring real-world constraints, as well as social incentives (e.g. on LessWrong IMO) to downplay boring these constraints and cognitive biases against thinking your preferred strategy has big downsides.
I agree that the CAIS statement, Hinton leaving Google, and Bengio and Hogarth’s writing have been great. I think that these are all in a highly distinct category from proposing specific actors take specific radical actions (unless I’m misremembering the Hogarth piece). Yudkowsky’s TIME article, on the other hand, definitely counts as an Overton Window move, and I’m surprised that you think this has had net positive effects. I regularly hear “bombing datacenters” as an example of a clearly extreme policy idea, sometimes in a context that sounds like it maybe made the less-radical idea seem more reasonable, but sometimes as evidence that the “doomers” want to do crazy things and we shouldn’t listen to them, and often as evidence that they are at least socially clumsy, don’t understand how politics works, etc, which is related to the things you list as the stuff that actually poisons the well. (I’m confused about the sign of the FLI letter as we’ve discussed.)
I’m not sure optimism vs pessimism is a crux, except in very short, like, 3-year timelines. It’s true that optimists are more likely to value small wins, so I guess narrowly I agree that a ratchet strategy looks strictly better for optimists, but if you think big radical changes are needed, the question remains of whether you’re more likely to get there via asking for the radical change now or looking for smaller wins to build on over time. If there simply isn’t time to build on these wins, then yes, better to take a 2% shot at the policy that you actually think will work; but even in 5-year timelines I think you’re better positioned to get what you ultimately want by 2029 if you get a little bit of what you want in 2024 and 2026 (ideally while other groups also make clear cases for the threat models and develop the policy asks, etc.). Another piece this overlooks is the information and infrastructure built by the minor policy changes. A big part of the argument for the reporting requirements in the EO was that there is now going to be an office in the US government that is in the business of collecting critical information about frontier AI models and figuring out how to synthesize it to the rest of government, that has the legal authority to do this, and both the office and the legal authority can now be expanded rather than created, and there will now be lots of individuals who are experienced in dealing with this information in the government context, and it will seem natural that the government should know this information. I think if we had only been developing and advocating for ideal policy, this would not have happened (though I imagine that this is not in fact what you’re suggesting the community do!).
It’s not just that problem though, they will likely be biased to think that their policy is helpful for safety of AI at all, and this is a point that sometimes gets forgotten.
But correct on the fact that Akash’s argument is fully general.
Recently, John Wentworth wrote:
And I think this makes sense (e.g. Simler’s Social Status: Down the Rabbit Hole which you’ve probably read), if you define “AI Safety” as “people who think that superintelligence is serious business or will be some day”.
The psych dynamic that I find helpful to point out here is Yud’s Is That Your True Rejection post from ~16 years ago. A person who hears about superintelligence for the first time will often respond to their double-take at the concept by spamming random justifications for why that’s not a problem (which, notably, feels like legitimate reasoning to that person, even though it’s not). An AI-safety-minded person becomes wary of being effectively attacked by high-status people immediately turning into what is basically a weaponized justification machine, and develops a deep drive wanting that not to happen. Then justifications ensue for wanting that to happen less frequently in the world, because deep down humans really don’t want their social status to be put at risk (via denunciation) on a regular basis like that. These sorts of deep drives are pretty opaque to us humans but their real world consequences are very strong.
Something that seems more helpful than playing whack-a-mole whenever this issue comes up is having more people in AI policy putting more time into improving perspective. I don’t see shorter paths to increasing the number of people-prepared-to-handle-unexpected-complexity than giving people a broader and more general thinking capacity for thoughtfully reacting to the sorts of complex curveballs that you get in the real world. Rationalist fiction like HPMOR is great for this, as well as others e.g. Three Worlds Collide, Unsong, Worth the Candle, Worm (list of top rated ones here). With the caveat, of course, that doing well in the real world is less like the bite-sized easy-to-understand events in ratfic, and more like spotting errors in the methodology section of a study or making money playing poker.
I think, given the circumstances, it’s plausibly very valuable e.g. for people already spending much of their free time on social media or watching stuff like The Office, Garfield reruns, WWI and Cold War documentaries, etc, to only spend ~90% as much time doing that and refocusing ~10% to ratfic instead, and maybe see if they can find it in themselves to want to shift more of their leisure time to that sort of passive/ambient/automatic self-improvement productivity.
These are plausible concerns, but I don’t think they match what I see as a longtime DC person.
We know that the legislative branch is less productive in the US than it has been in any modern period, and fewer bills get passed (many different metrics for this, but one is https://www.reuters.com/graphics/USA-CONGRESS/PRODUCTIVITY/egpbabmkwvq/) . Those bills that do get passed tend to be bigger swings as a result—either a) transformative legislation (e.g., Obamacare, Trump tax cuts and COVID super-relief, Biden Inflation Reduction Act and CHIPS) or b) big omnibus “must-pass” bills like FAA reauthorization, into which many small proposals get added in.
I also disagree with the claim that policymakers focus on credibility and consensus generally, except perhaps in the executive branch to some degree. (You want many executive actions to be noncontroversial “faithfully executing the laws” stuff, but I don’t see that as “policymaking” in the sense you describe it.)
In either of those, it seems like the current legislative “meta” favors bigger policy asks, not small wins, and I’m having trouble of thinking of anyone I know who’s impactful in DC who has adopted the opposite strategy. What are examples of the small wins that you’re thinking of as being the current meta?
The FDC just fined US phone carriers for sharing the location data of US customers to anyone willing to buy them. The fines don’t seem to be high enough to deter this kind of behavior.
That likely includes either directly or indirectly the Chinese government.
What does the US Congress do to protect spying by China? Of course, banning tik tok instead of actually protecting the data of US citizens.
If you have thread models that the Chinese government might target you, assume that they know where your phone is and shut it of when going somewhere you don’t want the Chinese government (or for that matter anyone with a decent amount of capital) to know.
I don’t have confidence in my models of how coherent and competent governments are at getting and using data like this. The primary buyers of location data are advertisers and business planners looking for statistical correlations for targeting and decisions. This is creepy, but not directly comparable to “targeted by the Chinese government”.
My competing theories of “targeted by the Chinese government” threats are:
they’re hyper-competent and have employee/agents at most carriers who will exfiltrate needed data, so stopping the explicit sale just means it’s less visible.
they’re as bureaucratic and confused as everything else, so even if they know where you are, they’re unable to really do much with it.
I think the tension is what does it even mean to be targeted by a government.
The Office of the Director of National Intelligence wrote a report about this question that was declassified last year. They use the abbreviation CAI for “commercially accessible data”.
“2.5. (U) Counter-Intelligence Risks in CAI. There is also a growing recognition that CAI, as a generally available resource, offers intelligence benefits to our adversaries, some of which may create counter-intelligence risk for the IC. For example, the January 2021 CSIS report cited above also urges the IC to “test and demonstrate the utility of OSINT and AI in analysis on critical threats, such as the adversary use of AI-enabled capabilities in disinformation and influence operations.”
Last month there was a political fight about warrant requirements when the US intelligence agencies use commercially brought data, that was likely partly caused by the concerns from that report.
Here, I mean that you are doing something that’s of interest to Chinese intelligence services. People who want to lobby for Chinese AI policy probably fall under that class.
I’m not sure to what extent people working at top AI labs might be blackmailed by the Chinese government to do things like give them their source code.
[note: I suspect we mostly agree on the impropriety of open selling and dissemination of this data. This is a narrow objection to the IMO hyperbolic focus on government assault risks. ]
I’m unhappy with the phrasing of “targeted by the Chinese government”, which IMO implies violence or other real-world interventions when the major threats are “adversary use of AI-enabled capabilities in disinformation and influence operations.” Thanks for mentioning blackmail—that IS a risk I put in the first category, and presumably becomes more possible with phone location data. I don’t know how much it matters, but there is probably a margin where it does.
I don’t disagree that this purchasable data makes advertising much more effective (in fact, I worked at a company based on this for some time). I only mean to say that “targeting” in the sense of disinformation campaigns is a very different level of threat from “targeting” of individuals for government ops.
Whether or not you face government assault risks depends on what you do. Most people don’t face government assault risks. Some people engage in work or activism that results in them having government assault risks.
The Chinese government has strategic goals and most people are unimportant to those. Some people however work on topics like AI policy in which the Chinese government has an interest.
I feel like this comparison of the enforcement here with the TikTok ban is not directed at the actual primary concern about TikTok, which is content curation by its opaque algorithm, not data privacy per se.
By analogy, if a Soviet state-owned enterprise in 1980 wanted to purchase NBC, would/should we have allowed that? If your answer is “no,” keeping in mind how many people get their news via TikTok, why would/should we allow what effectively seems to be a CCP-(owned or heavily influenced) company to control what content our people see?
Politico wrote, “Perhaps the most pressing concern is around the Chinese government’s potential access to troves of data from TikTok’s millions of users.” The concern that TikTok supposedly is spyware is frequently made in discussions about why it should be banned.
If the main issue is content moderation decisions, the best way to deal with it would be to legislate transparency around content moderation decisions and require TikTok to outsource the moderation decisions to some US contractor.