jacob_cannell

Karma: 6,096

I have a compute-market startup called vast.ai, and I’m working towards aligned AI. Currently seeking networking, collaborators, and hires—especially top notch cuda/gpu programmers.

My personal blog: https://entersingularity.wordpress.com/

jacob_cannell Feb 14, 2025, 2:01 AM
9 points
0
in reply to: gwern’s comment on: Inference cost limits the impact of ever larger models
The effectiveness of weight sharing (and parameter compression in general) diminishes as you move the domain from physics (simple rules/patterns tiled over all of space/time) up to language/knowledge (downstream facts/knowledge that are far too costly to rederive from simulation).

BNNs cant really take advantage of weight sharing so much, so ANNs that are closer to physics should be much smaller parameter wise, for the same compute and capability. Which is what we observer for lower level sensor/motor modalities.

jacob_cannell Feb 14, 2025, 1:52 AM
3 points
−2
in reply to: yo-cuddles’s comment on: The Game Board has been Flipped: Now is a good time to rethink what you’re doing
The single factor prime causative factor driving the explosive growth in AI demand/revenue is and always has been the exponential reduction in $/flop via moore’s law, which simply is jevon’s paradox manifested. With more compute everything is increasingly easy and obvious; even idiots can create AGI with enough compute.

jacob_cannell Jan 16, 2025, 12:09 AM
2 points
0
in reply to: ryan_greenblatt’s comment on: How will we update about scheming?

Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model.

Yes, but for scheming, we care about whether the AI can self-locate itself as an AI using its knowledge. The fact that (at a minimum) sampling from the system is required for it to self-locate as an AI might make a big difference here.

So if your ‘yes’ above is agreeing that capabilities—including scheming—come mostly from pretraining, then I don’t see how relevant it is whether or not that ability is actually used/executed much in pretraining, as the models we care about will go through post-training and I doubt you are arguing post-training will reliably remove scheming.

I also think it seems probably very hard to train a system capable of obsoleting top human experts which doesn’t understand that it is an AI even if you’re willing to take a big competitiveness hit.

Indeed but that is entirely the point—by construction!

Conceptually we have a recipe R (arch, algorithms, compute, etc), and a training dataset which we can parameterize by time cutoff T. Our objective (for safety research) is not to train a final agent, but instead to find a safe/good R with minimal capability penalty. All important results we care about vary with R independently of T, but competitiveness/dangerousness does vary strongly with T.

Take the same R but vary the time cutoff T of the training dataset: the dangerousness of the AI will depend heavily on T, but not the relative effectiveness of various configurations of R. That is simply a restatement of the ideal requirements for a safe experimental regime. Models/algos that work well for T of 1950 will also work for T of 2020 etc.

jacob_cannell Jan 12, 2025, 10:07 PM
7 points
5
on: How will we update about scheming?
Training processes with varying (apparent) situational awareness
- 1:2.5 The AI seemingly isn’t aware it is an AI except for a small fraction of training which isn’t where much of the capabilities are coming from. For instance, the system is pretrained on next token prediction, our evidence strongly indicates that the system doesn’t know it is an AI when doing next token prediction (which likely requires being confident that it isn’t internally doing a substantial amount of general-purpose thinking about what to think about), and there is only a small RL process which isn’t where much of the capabilities are coming from.
Abilities/intelligence come almost entirely from pretraining, so all the situation awareness and scheming capability that current (and future similar) frontier models possess is thus also mostly present in the base model. The fact that you need to prompt them to summon out a situationally aware scheming agent doesn’t seem like much of a barrier, and indeed strong frontier base models are so obviously misaligned/jail-breakable/dangerous that releasing them to the public is PR-harmful enough to motivate RLHF post training purely for selfish profit-motives.

> This implies that restricting when AIs become (saliently) aware that they are an AI could be a promising intervention, to the extent this is possible without greatly reducing competitiveness.

Who cares if it greatly reduces competitiveness in experimental training runs?

We need to figure out how to align superhuman models—models trained with > 1e25 efficient flops on the current internet/knowledge, which requires experimental iteration. We probably won’t get multiple iteration attempts for aligning SI ‘in prod’, so we need to iterate in simulation (what you now call ‘model organisms’).

We need to find alignment training methods that work even when the agent has superhuman intelligence/inference. But ‘superhuman’ hear is relative—measured against our capabilities. The straightforward easy way to accomplish this is training agents in simulations with much earlier knowledge cutoff dates, which isn’t theoretically hard—just requires constructing augmented historical training datasets. So you could train on a 10T+ token dataset of human writings/thoughts with cutoff 2010, or 1950, or 1700, etc. These base models wouldn’t be capable of simulating/summoning realistic situationally aware agents, their RL derived agents wouldn’t be situationally sim-aware either, etc.

jacob_cannell Jun 8, 2024, 4:44 PM
9 points
1
in reply to: ryan_greenblatt’s comment on: We are headed into an extreme compute overhang
Input vs output tokens are both unique per agent history (prompt + output), so that differentiation doesn’t matter for my core argument about the RAM constraint. If you have a model which needs 1TB of KV cache, and you aren’t magically sharing that significantly between instances, then you’ll need at least 1000 * 1TB of RAM to run 1000 inferences in parallel.

The 3x − 10x cost ratio model providers charge is an economic observation that tells us something about the current cost vs utility tradeoffs, but it’s much complicated by oversimpliciation of the current pricing models (they are not currently charging their true costs, probably because that would be too complicated, but also perhaps reveal too much information—their true cost would be more like charging rent on RAM for every timestep). It just tells you that very roughly, that on average, the mean (averaged over many customer requests) flop utilization of the generation phase (parallel over instances) is perhaps 3x to 10x lower than the prefill phase (parallel over time) - but it doesn’t directly tell you why.

This is all downstream dependent on model design and economics. There are many useful requests that LLMs can fulfill without using barely any KV cache—essentially all google/oracle type use cases where you are just asking the distilled wisdom of the internet a question. If those were all of the request volume, then the KV cache RAM per instance would be inconsequential, inference batch sizes would be > 1000, inference flop utilization would be the same for prefill vs generation, and providers would charge the same price for input vs output tokens.

On the other extreme, if all requests used up the full training context window, then the flop utilization of inference would be constrained by approximately (max_KV_cache_RAM + weight_RAM / max_KV_cache_RAM ) / alu_ratio. For example if the KV cache is 10% of RAM, and alu_ratio is 1000:1, generation would have max efficiency of 1%. If infill efficiency was 30%, then output tokens would presumably be priced 30x more than input tokens.

So the observed input:output token pricing is dependent on the combination of KV_cache RAM fraction (largely a model design decision), current efficiency of implementations of infill vs generation, and most importantly—the distribution of request prompt lengths, which itself is dependent on the current economic utility of shorter vs longer prompts for current models.

In practice most current models have a much smaller KV cache to weight RAM fraction than my simple 1:1 example, but the basic point holds: training is more flop & interconnect limited, inference is more RAM and ram bw limited. But these constraints already shape the design space of models and how they are deployed.

LLMs currently excel at anything a human knowledge worker can do without any specific training (minimal input prompt length), but largely aren’t yet competitive with human experts at most real world economic tasks that require significant unique per-job training. Coding is a good example—human thoughtspeed is roughly 9 token/s, or 32K/hour, or 256K per 8 hour work day, or roughly 1M tokens per week.

Current GPT4-turbo (one of the current leaders for coding), for example, has a max context length of 128K (roughly 4 hours). But if you actually use all of that for each request for typical coding requests that generate say 1K of useful output (equivalent to a few minutes of human thought), that will cost you about $1.25 for the input tokens, but only about $0.03 for the output tokens. That costs about as much as a human worker, per minute of output thought tokens. The cost of any LLM agent today (per minute of output thought) increases linearily with input prompt length—ie the agent’s unique differentiating short term memory. Absent more sophisticated algorithms, the cost of running a react-like LLM agent thus grows quadratically with time, vs linear for humans (because each small observe-act time step has cost proportional to input context length, which grows per time step).

Human programmers aren’t being replaced en masse (yet) in part because current models aren’t especially smarter than humans at equivalent levels of job-specific knowledge/training.

Normalized for similar ability, LLMs currently are cheaper than humans at most any knowledge work that requires very little job-specific knowledge/training, and much more expensive than humans for tasks that require extensive job-specific knowledge/training—and this has everything to do with how transformers currently consume and utilize VRAM.

jacob_cannell Jun 8, 2024, 2:24 AM
1 point
−2
in reply to: devrandom’s comment on: We are headed into an extreme compute overhang
Not for transformers, for which training and inference are fundamentally different.

Transformer training parallelizes over time, but that isn’t feasible for inference. So transformer inference backends have to parallelize over batch/space, just like RNNs, which is enormously less efficient in RAM and RAM bandwidth use.

So if you had a large attention model that uses say 1TB of KV cache (fast weights) and 1TB of slow weights, transformer training can often run near full efficiency, flop limited, parallelizing over time.

But similar full efficient transformer inference would require running about K instances/agents in parallel, where K is the flop/mem_bw ratio (currently up to 1000 on H100). So that would be 1000 * 1TB of RAM for the KV cache (fast weights) as its unique per agent instance.

This, in a nutshell, is part of why we don’t already have AGI. Transformers are super efficient at absorbing book knowledge, but just as inefficient as RNNs at inference (generating new experiences, which is a key bottleneck on learning from experience).

Thus there is of course much research in more efficient long kv cache, tree/graph inference that can share some of the KV cache over similar branching agents, etc

jacob_cannell Jun 5, 2024, 5:45 PM
9 points
0
on: We are headed into an extreme compute overhang
Due to practical reasons, the compute requirements for training LLMs is several orders of magnitude larger than what is required for running a single inference instance. In particular, a single NVIDIA H100 GPU can run inference at a throughput of about 2000 tokens/s, while Meta trained Llama3 70B on a GPU cluster^[1] of about 24,000 GPUs. Assuming we require a performance of 40 tokens/s, the training cluster can run $\frac{2000}{40} \times 24000 = 1, 200, 000$ concurrent instances of the resulting 70B model.

I agree direction-ally with your headline, but your analysis here assumes flops is the primary constraint on inference scaling. Actually it looks like VRAM is already the more important constraint, and would likely become even more dominant if AGI requires more brain-like models.

LLMs need VRAM for both ‘static’ and ‘dynamic’ weights. The static weights are the output of the long training process, and shared over all instances of the same model or fine tune (LORAs share most). However the dynamic ‘weights’ - in the attention KV cache—are essentially unique to each individual instance of the model, specific to its current working memory context and chain of thought.

So the key parameters here are total model size and dynamic vs static ratio (which depends heavily on context length and many other factors). But for example if dynamic is 50% of the RAM usage then 1M concurrent instances would require almost as many GPUs.

If AGI requires scaling up to very large brain-size models ~100T params (which seems likely), and the dynamic ratio is even just 1%, then 1M concurrent instances would require on order 10M GPUs.

jacob_cannell Jan 5, 2024, 5:11 PM
1 point
−3
in reply to: quetzal_rainbow’s comment on: Gentleness and the artificial Other
How is that even remotely relevant? Humans and AIs learn the same way, via language—and its not like this learning process fails just because language undersamples thoughts.

jacob_cannell Jan 4, 2024, 4:59 AM
13 points
2
in reply to: dr_s’s comment on: Gentleness and the artificial Other
As the article points out, shared biological needs do not much deter the bear or chimpanzee from killing you. An AI could be perfectly human—the very opposite of alien—and far more dangerous than Hitler or Dhamer.

The article is well written but dangerously wrong in its core point. AI will be far more human than alien. But alignment/altruism is mostly orthogonal to human vs alien.

jacob_cannell Jan 3, 2024, 11:13 PM
4 points
2
in reply to: quetzal_rainbow’s comment on: Gentleness and the artificial Other

We are definitely not training AIs on human thoughts because language is an expression of thought, not thought itself.

Even if training on language was not equivalent to training on thoughts, that would also apply to humans.

But it also seems false in the same way that “we are definitely not training AI’s on reality because image files are compressed sampled expressions of images, not reality itself” is false.

Approximate bayesian inference (ie DL) can infer the structure of a function through its outputs; the structure of the 3D world through images; and thoughts through language.

jacob_cannell Jan 3, 2024, 1:15 AM
7 points
−21
on: Gentleness and the artificial Other

Premise 1: AGIs would be like a second advanced species on earth, more powerful than humans.

Distinct alien species arise only from distinct separated evolutionary histories. Your example of the aliens from Arrival are indeed a good (hypothetical) example of truly alien minds resulting from a completely independent evolutionary history on an alien world. Any commonalities between us and them would be solely the result of convergent evolutionary features. They would have completely different languages, cultures, etc.

AI is not alien at all, as we literally train AI on human thoughts. As a result we constrain our AI systems profoundly, creating them in our mental image. Any AGI we create will inevitably be far closer to human uploads than alien minds. This a prediction Moravec made as early as 1988 (Mind Children) - now largely fulfilled by the strong circuit convergence/correspondence between modern AI and brains.

Minds are software mental constructs, and alien minds would require alien culture. Instead we are simply creating new hardware for our existing (cultural) mind software.

jacob_cannell Dec 27, 2023, 3:10 AM
11 points
0
in reply to: Thane Ruthenis’s comment on: Current AIs Provide Nearly No Data Relevant to AGI Alignment
I also not sure of the relevance and not following the thread fully, but the summary of that experiment is that it takes some time (measured in nights of sleep which are rough equivalent of big batch training updates) for the newly sighted to develop vision, but less time than infants—presumably because the newly sighted already have full functioning sensor inference world models in another modality that can speed up learning through dense top down priors.

But its way way more than “grok it really fast with just a few examples”—training their new visual systems still takes non-trivial training data & time

jacob_cannell Dec 27, 2023, 3:02 AM
2 points
0
on: Broad Picture of Human Values
I suspect that much of the appeal of shard theory is working through detailed explanations of model-free RL with general value function approximation for people who mostly think of AI in terms of planning/search/consequentialism.

But if you already come from a model-free RL value approx perspective, shard theory seems more natural.

Moment to moment decisions are made based on value-function bids, with little to no direct connection to reward or terminal values. The ‘shards’ are just what learned value-function approximating subcircuits look like in gory detail.

The brain may have a prior towards planning subcircuitry, but even without a strong prior planning submodules will eventually emerge naturally in a model-free RL learning machine of sufficient scale (there is no fundamental difference between model-free and model-based for universal learners). TD like updates ensure that the value function extends over longer timescales as training progresses. (and in general humans seem to plan on timescales which scale with their lifespan, as you’d expect)

jacob_cannell Dec 25, 2023, 8:20 PM
4 points
1
in reply to: Mo Putera’s comment on: Contra Yudkowsky on AI Doom
TSMC 4N is a little over 1e10 transistors/cm^2 for GPUs and roughly 5e^-18 J switch energy assuming dense activity (little dark silicon). The practical transistor density limit with minimal few electron transistors is somewhere around ~5e11 trans/cm^2, but the minimal viable high speed switching energy is around ~2e^-18J. So there is another 1 to 2 OOM further density scaling, but less room for further switching energy reduction. Thus scaling past this point increasingly involves dark silicon or complex expensive cooling and thus diminishing returns either way.

Achieving 1e-15 J/flop seems doable now for low precision flops (fp4, perhaps fp8 with some tricks/tradeoffs); most of the cost is data movement as pulling even a single bit from RAM just 1 cm away costs around 1e-12J.

jacob_cannell Dec 25, 2023, 12:36 AM
3 points
1
in reply to: Mo Putera’s comment on: Contra Yudkowsky on AI Doom
Part of the issue is my post/comment was about moore’s law (transistor density for mass produced nodes), which is a major input to but distinct from flops/$. As I mentioned somewhere, there is still some free optimization energy in extracting more flops/$ at the circuit level even if moore’s law ends. Moore’s law is very specifically about fab efficiency as measured in transistors/cm^2 for large chip runs—not the flops/$ habyrka wanted to bet on. Even when moore’s law is over, I expect some continued progress in flops/$.

All that being said, nvidia’s new flagship GPU everyone is using—the H100 which is replacing the A100 and launched just a bit after habryka proposed the bet—actually offers near zero improvement in flops/$ (the price increased in direct proportion to flops increase). So I probably should have taken the bet if it was narrowly defined as (flops/$ for the flagship gpus most teams using currently for training foundation models).

jacob_cannell Dec 23, 2023, 1:22 AM
13 points
2
on: Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)
I don’t know who first said it, but the popular saying “Computer vision is the inverse of computer graphics” encompasses much of this viewpoint.

Computer graphics is the study/art of the approximation theory you mention and fairly well developed & understood in terms of how to best simulate worlds & observations in real-time from the perspective of an observer. But of course traditional graphics uses human-designed world models and algorithms.

Diffusion models provide a general framework for learning a generative model in the other direction—in part by inverting trained vision and noise models.

So naturally there is also diffusion planning which is an example of the symmetry you discuss: using general diffusion inference for planning. The graph dimensions end up being both space-time and abstraction level with the latter being more important: sensor inference moves up the abstraction/compression hierarchy, whereas planning/acting/generating moves down.

jacob_cannell Dec 21, 2023, 6:35 PM
5 points
3
on: Prediction Markets aren’t Magic
Even if there is no acceptable way to share the data semi-anonymously outside of match group, the arguments for prediction markets still apply within match group. A well designed prediction market would still be a better way to distribute internal resources and rewards amongst competing data science teams within match group.

But I’m skeptical that the value of match group’s private data is dominant even in the fully private data scenario. People who actually match and meetup with another user will probably have important inside view information inaccessible to the algorithms of match group.

Manifold.Love’s lack of success is hardly much evidence against the utility of prediction markets for dating markets, any more or less than most startup’s failure at X is evidence against the utility of X.

jacob_cannell Dec 19, 2023, 6:20 PM
4 points
0
in reply to: Steven Byrnes’s comment on: [Valence series] 5. “Valence Disorders” in Mental Health & Personality
Certainly mood disorders like bipolar,depression,mania can have multiple causes—for examle simply doing too much dopaminergic simulants (cocaine, meth etc) can cause mania directly.

But the modern increased prevalence of mood disorders is best explained by a modern divergence from conditions in the ancestral environment, and sleep disorder due to electric lighting disrupting circadian rhythms is a good fit to the evidence.

The evidence for each of my main points is fairly substantial and now mainstream, the only part which isn’t mainstream (yet) is the specific causal mechanism linking synaptic pruning/normalization to imbalance in valence computing upper brain modules (but it’s also fairly straightforward obvious from a DL perspective—we know that training stability is an intrinsic likely failure mode).

A few random links:
- wikipedia page on sleep and bipolar
- The role of CLOCK gene in psychiatric disorders: Evidence from human and animal research
REM and synaptic normalization/pruning/homeostasis:
Sleep and Psychiatric Disorders:
- Sleep disturbance and psychiatric disorders : “It is argued that insomnia and other mental health conditions not only share common causes but also show a bidirectional relationship, with typically the strongest pathway being disrupted sleep as a causal factor in the occurrence of other psychiatric problems.”
- Improving sleep quality leads to better mental health: A meta-analysis of randomised controlled trials: “For example, people with insomnia are 10 and 17 times more likely than those without insomnia to experience clinically significant levels of depression and anxiety, respectively”
- The brain structure and genetic mechanisms underlying the nonlinear association between sleep duration, cognition and mental health
The effectiveness of circadian interventions through the blue light pineal gland serotonin->melatonin pathway is also very well established: daytime bright light therapy has long been known to be effective for depression, nighttime blue light reduction is now also recognized as important/effective, etc.

The interventions required to promote healthy sleep architecture are not especially expensive and are certainly not patentable, so they are in a blindspot for our current partially misaligned drug-product focused healthcare system. Of course there would be a market for a hypothetical drug which could target and fix the specific issues that some people have with sleep quality—but instead we just have hammers like benzos and lithium which cause as many or more problems than they solve.

jacob_cannell Dec 19, 2023, 7:11 AM
25 points
1
on: [Valence series] 5. “Valence Disorders” in Mental Health & Personality
From my own study of mood disorders I generally agree with your valence theory of depression/mania.

However I believe the primary cause (at least for most people today) is disrupted sleep architecture.

To a first order approximation, the brain accumulates batch episodic training data during the day through indexing in the hippocampus (which is similar-ish to upper cortex, but more especially adapted to medium term memory & indexing). The brain’s main episodic replay training then occurs during sleep, with alternation of several key phases (REM and several NREM) with unique functional roles. During NREM (SWS in particular) the hippocampus rehearses sequences to ‘train’ the cortex via episodic replay. (Deepmind’s first atari RL agent is based on directly reverse engineering this mechanism).

But the REM sleep is also vitally important—and it seems to globally downscale/prune synaptic connections, most specifically the weakest and least important. It may also be doing something more complex in subtracting out the distribution of internally generated data ala Hinton’s theories (but maybe not, none of his sleep wake algos actually work well yet).

Regardless the brain does not seem to maintain synaptic strength balance on the hourly timescale. Instead median/average synaptic strength slowly grows without bound during the waking state, and is not correctly renormalized until pruning/renormalization during sleep—and REM sleep most specifically.

This explains many curious facts known of mania and depression:
- The oldest known treatment for depression is also completely (but only temporarily) effective: sleep deprivation. Depression generally does not survive sleep deprivation.
- Sleep is likewise effective to treat full blown mania, but mania inhibits sleep. One of the early successes in psychiatry was the use of sedatives to treat severe mania.
- Red light interferes with the circadian rhythm—specifically serotonin->melatonin conversion, and thereby can disrupt sleep architecture (SAD etc)
- SSRIs alter effective serotonin transport quickly but take a week or more to have noticeable effects on mood. Serotonin directly blocks REM—REM sleep is characterized (and probably requires) a near complete absence of monoamine neurotransmitters (histamine, serotonin and norepinephrine).
- Lithium—a common treatment for bipolar—is a strong cellular circadian modulator and sleep stabilizer.
So basically the brain does not maintain perfect homeostatic synaptic normalization balance on short timescales. During wake synapses tend to strengthen, and during REM sleep they are pruned/weakened. Balancing this correctly seems to rely on a fairly complex sleep architecture, disruptions to which can cause mood disorders—not immediately, but over weeks/months.

But why does mean synaptic strength imbalance effect mostly mood and not say vision or motor control? Every synapse and brain region has a characteristic plasticity timescale that varies wildly. Peripheral lower regions (closer to sensors/motors) crystallize early and have low learning rate/plasticity in adults, so they aren’t very susceptible. At any one time in life the hippocampal → cortical episodic replay is focusing on particular brain modules, and in adults that focus is mostly on upper regions (PFC etc) that mostly store current plans, consequences, etc that are changing more rapidly.

Thus the upper brain regions that are proposing and computing the valence of various (actual or mental) actions as ‘dopaminergic bids’ with respect to current plans/situations are the most sensitive to synaptic norm imbalance, because they change at higher frequency. Of course if a manic stays awake long enough they do in fact progress to psychosis similar to schizophrenia.
What links here?
- Morpheus's comment on Morpheus’s Shortform by Morpheus (Oct 21, 2024, 7:54 PM; 1 point)

jacob_cannell Dec 18, 2023, 6:41 PM
2 points
0
in reply to: Thane Ruthenis’s comment on: A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans

Sure, but how often do the colonized end up better off for it, especially via trying to employ clever play-both-sides strategies?

I didn’t say the colonized generally ended up better off, but outcomes did vary greatly. Just in the US the cherokees faired much better than say the Susquehannock and Pequot, and if you dig into that history it seems pretty likely that decisions on which colonizer(s) to ally with (british, french, dutch, later american etc) were important, even if not “clever play-both-sides strategies” (although I’d be surprised if that wasn’t also tried somewhere at least once)

jacob_cannell

Training processes with varying (apparent) situational awareness